-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added collection of used tables from Python notebooks and files and SQL queries #2772
Conversation
Not marked as draft in order to run integration tests |
|
||
|
||
@dataclass | ||
class TableInfo(SourceInfo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class TableInfo(SourceInfo): | |
class UsedTable(SourceInfo): |
we already have table info - https://databricks-sdk-py.readthedocs.io/en/latest/dbdataclasses/catalog.html#databricks.sdk.service.catalog.TableInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
catalog_name: str = SourceInfo.UNKNOWN | ||
schema_name: str = SourceInfo.UNKNOWN | ||
table_name: str = SourceInfo.UNKNOWN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add is_read
and is_write
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Populated for sql. For python calls, I suggest doing in a separate PR since it's a lot of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate PR works
yield table_node.table | ||
|
||
@abstractmethod | ||
def collect_tables_from_source(self, source_code: str, inherited_tree: Tree | None) -> Iterable[TableInfoNode]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is not used in this abstract class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
logger = logging.getLogger(__name__) | ||
|
||
|
||
class TableInfoCrawler(CrawlerBase[TableInfo]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class TableInfoCrawler(CrawlerBase[TableInfo]): | |
class TableUsageCrawler(CrawlerBase[TableInfo]): |
also, it looks very much similar to DirectFsAccessCrawler
, just a different class. @asnare how can we make it more generic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to wait for the MultiCrawler to answer that question ?
…x into increase-test-coverage
|
||
catalog_name: str = SourceInfo.UNKNOWN | ||
schema_name: str = SourceInfo.UNKNOWN | ||
table_name: str = SourceInfo.UNKNOWN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate PR works
return UsedTable( | ||
catalog_name=catalog_name, | ||
schema_name=src_schema, | ||
table_name=table.name, | ||
is_read=isinstance(self._expression, Select), | ||
is_write=isinstance(self._expression, (Create, Update, Delete)), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return UsedTable( | |
catalog_name=catalog_name, | |
schema_name=src_schema, | |
table_name=table.name, | |
is_read=isinstance(self._expression, Select), | |
is_write=isinstance(self._expression, (Create, Update, Delete)), | |
) | |
is_read = isinstance(self._expression, Select) | |
return UsedTable( | |
catalog_name=catalog_name, | |
schema_name=src_schema, | |
table_name=table.name, | |
is_read=is_read, | |
is_write=not is_read, | |
) |
i don't know if sqlglot parses MERGE statement as UPDATE
def __init__(self, expression: Expression): | ||
self._expression = expression | ||
|
||
def collect_table_infos(self, required_catalog: str, session_state: CurrentSessionState) -> Iterable[UsedTable]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def collect_table_infos(self, required_catalog: str, session_state: CurrentSessionState) -> Iterable[UsedTable]: | |
def collect_used_tables(self, required_catalog: str, session_state: CurrentSessionState) -> Iterable[UsedTable]: |
|
||
def __init__(self, backend: SqlBackend, schema: str, table: str): | ||
""" | ||
Initializes a DFSACrawler instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove or update copy-pasted comments
directfs_crawler = create_autospec(DirectFsAccessCrawler) | ||
used_tables_crawler = create_autospec(UsedTablesCrawler) | ||
linter = WorkflowLinter( | ||
ws, dependency_resolver, mock_path_lookup, empty_index, directfs_crawler, used_tables_crawler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ws, dependency_resolver, mock_path_lookup, empty_index, directfs_crawler, used_tables_crawler | |
ws, dependency_resolver, mock_path_lookup, empty_index, directfs_crawler, used_tables_crawler, |
nit: make fmt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. please address remaining comments in a separate pr
## Changes Add type hints for cached properties and resolve the linting issues found after adding the type hints. Would have caught the integration test issues introduced in #2772 --------- Co-authored-by: Serge Smertin <serge.smertin@databricks.com>
* Added `Farama-Notifications` to known list ([#2822](#2822)). A new configuration has been implemented in this release to integrate Farama-Notifications into the existing system, partially addressing issue [#193](#193) * Added `aiohttp-cors` library to known list ([#2775](#2775)). In this release, we have added the `aiohttp-cors` library to our project, providing asynchronous Cross-Origin Resource Sharing (CORS) handling for the `aiohttp` library. This addition enhances the robustness and flexibility of CORS management in our relevant projects. The library includes several new modules such as "aiohttp_cors", "aiohttp_cors.abc", "aiohttp_cors.cors_config", "aiohttp_cors.mixin", "aiohttp_cors.preflight_handler", "aiohttp_cors.resource_options", and "aiohttp_cors.urldispatcher_router_adapter", which offer functionalities for configuring and handling CORS in `aiohttp` applications. This change partially resolves issue [#1931](#1931) and further strengthens our application's security and cross-origin resource sharing capabilities. * Added `category-encoders` library to known list ([#2781](#2781)). In this release, we've added the `category-encoders` library to our supported libraries, which provides a variety of methods for encoding categorical variables as numerical data, including one-hot encoding and target encoding. This addition resolves part of issue [#1931](#1931), which concerned the support of this library. The library has been integrated into our system by adding a new entry for `category-encoders` in the known.json file, which contains several modules and classes corresponding to various encoding methods provided by the library. This enhancement enables software engineers to leverage the capabilities of `category-encoders` library to encode categorical variables more efficiently and effectively. * Added `cmdstanpy` to known list ([#2786](#2786)). In this release, we have added `cmdstanpy` and `stanio` libraries to our codebase. `cmdstanpy` is a Python library for interfacing with the Stan probabilistic programming language and has been added to the whitelist. This addition enables the use of `cmdstanpy`'s functionalities, including loading, inspecting, and manipulating Stan model objects, as well as running MCMC simulations. Additionally, we have included the `stanio` library, which provides functionality for reading and writing Stan data and model files. These additions enhance the codebase's capabilities for working with probabilistic models, offering expanded options for loading, manipulating, and simulating models written in Stan. * Added `confection` library to known list ([#2787](#2787)). In this release, the `confection` library, a lightweight, pure Python library for parsing and formatting cookies with two modules for working with cookie headers and utility functions, has been added to the known list of libraries and is now usable within the project. Additionally, several modules from the `srsly` library, a collection of serialization utilities for Python including support for JSON, MessagePack, cloudpickle, and Ruamel YAML, have been added to the known list of libraries, increasing the project's flexibility and functionality in handling serialized data. This partially resolves issue [#1931](#1931). * Added `configparser` library to known list ([#2796](#2796)). In this release, we have added support for the `configparser` library, addressing issue [#1931](#1931). `Configparser` is a standard Python library used for parsing configuration files. This change not only whitelists the library but also includes the "backports.configparser" and "backports.configparser.compat" modules, providing backward compatibility for older versions of Python. By recognizing and supporting the `configparser` library, users can now utilize it in their code with confidence, knowing that it is a known and supported library. This update also ensures that the backports for older Python versions are recognized, enabling users to leverage the library seamlessly, regardless of the Python version they are using. * Added `diskcache` library to known list ([#2790](#2790)). A new update has been made to include the `diskcache` library in our open-source library's known list, as detailed in the release notes. This addition brings in multiple modules, including `diskcache`, `diskcache.cli`, `diskcache.core`, `diskcache.djangocache`, `diskcache.persistent`, and `diskcache.recipes`. The `diskcache` library is a high-performance caching system, useful for a variety of purposes such as caching database queries, API responses, or any large data that needs frequent access. By adding the `diskcache` library to the known list, developers can now leverage its capabilities in their projects, partially addressing issue [#1931](#1931). * Added `dm-tree` library to known list ([#2789](#2789)). In this release, we have added the `dm-tree` library to our project's known list, enabling its integration and use within our software. The `dm-tree` library is a C++ API that provides functionalities for creating and manipulating tree data structures, with support for sequences and tree benchmarking. This addition expands our range of available data structures, addressing the lack of support for tree data structures and partially resolving issue [#1931](#1931), which may have been related to the integration of the `dm-tree` library. By incorporating this library, we aim to enhance our project's performance and versatility, providing software engineers with more options for handling tree data structures. * Added `evaluate` to known list ([#2821](#2821)). In this release, we have added the `evaluate` package and its dependent libraries to our open-source library. The `evaluate` package is a tool for evaluating and analyzing machine learning models, providing a consistent interface to various evaluation tasks. Its dependent libraries include `colorful`, `cmdstanpy`, `comm`, `eradicate`, `multiprocess`, and `xxhash`. The `colorful` library is used for colorizing terminal output, while `cmdstanpy` provides Python infrastructure for Stan, a platform for statistical modeling and high-performance statistical computation. The `comm` library is used for creating and managing IPython comms, and `eradicate` is used for removing unwanted columns from pandas DataFrame. The `multiprocess` library is used for spawning processes, and `xxhash` is used for the XXHash algorithms, which are used for fast hash computation. This addition partly resolves issue [#1931](#1931), providing enhanced functionality for evaluating machine learning models. * Added `future` to known list ([#2823](#2823)). In this commit, we have added the `future` module, a compatibility layer for Python 2 and Python 3, to the project's known list in the configuration file. This module provides a wide range of backward-compatible tools and fixers to smooth over the differences between the two major versions of Python. It includes numerous sub-modules such as "future.backports", "future.builtins", "future.moves", and "future.standard_library", among others, which offer backward-compatible features for various parts of the Python standard library. The commit also includes related modules like "libfuturize", "libpasteurize", and `past` and their respective sub-modules, which provide tools for automatically converting Python 2 code to Python 3 syntax. These additions enhance the project's compatibility with both Python 2 and Python 3, providing developers with an easier way to write cross-compatible code. By adding the `future` module and related tools, the project can take full advantage of the features and capabilities provided, simplifying the process of writing code that works on both versions of the language. * Added `google-api-core` to known list ([#2824](#2824)). In this commit, we have added the `google-api-core` and `proto-plus` packages to our codebase. The `google-api-core` package brings in a collection of modules for low-level support of Google Cloud services, such as client options, gRPC helpers, and retry mechanisms. This addition enables access to a wide range of functionalities for interacting with Google Cloud services. The `proto-plus` package includes protobuf-related modules, simplifying the handling and manipulation of protobuf messages. This package includes datetime helpers, enums, fields, marshaling utilities, message definitions, and more. These changes enhance the project's versatility, providing users with a more feature-rich environment for interacting with external services, such as those provided by Google Cloud. Users will benefit from the added functionality and convenience provided by these packages. * Added `google-auth-oauthlib` and dependent libraries to known list ([#2825](#2825)). In this release, we have added the `google-auth-oauthlib` and `requests-oauthlib` libraries and their dependencies to our repository to enhance OAuth2 authentication flow support. The `google-auth-oauthlib` library is utilized for Google's OAuth2 client authentication and authorization flows, while `requests-oauthlib` provides OAuth1 and OAuth2 support for the `requests` library. This change partially resolves the missing dependencies issue and improves the project's ability to handle OAuth2 authentication flows with Google and other providers. * Added `greenlet` to known list ([#2830](#2830)). In this release, we have added the `greenlet` library to the known list in the configuration file, addressing part of issue [#193](#193) * Added `gymnasium` to known list ([#2832](#2832)). A new update has been made to include the popular open-source `gymnasium` library in the project's configuration file. The library provides various environments, spaces, and wrappers for developing and testing reinforcement learning algorithms, and includes modules such as "gymnasium.core", "gymnasium.envs", "gymnasium.envs.box2d", "gymnasium.envs.classic_control", "gymnasium.envs.mujoco", "gymnasium.envs.phys2d", "gymnasium.envs.registration", "gymnasium.envs.tabular", "gymnasium.envs.toy_text", "gymnasium.experimental", "gymnasium.logger", "gymnasium.spaces", and "gymnasium.utils", each with specific functionality. This addition enables developers to utilize the library without having to modify any existing code and take advantage of the latest features and bug fixes. This change partly addresses issue [#1931](#1931), likely related to using `gymnasium` in the project, allowing developers to now use it for developing and testing reinforcement learning algorithms. * Added and populate UCX `workflow_runs` table ([#2754](#2754)). In this release, we have added and populated a new `workflow_runs` table in the UCX project to track the status of workflow runs and handle concurrent writes. This update resolves issue [#2600](#2600) and is accompanied by modifications to the `migration-process-experimental` workflow, new `WorkflowRunRecorder` and `ProgressTrackingInstallation` classes, and updated user documentation. We have also added unit tests, integration tests, and a `record_workflow_run` method in the `MigrationWorkflow` class. The new table and methods have been tested to ensure they correctly record workflow run information. However, there are still some issues to address, such as deciding on getting workflow run status from `parse_log_task`. * Added collection of used tables from Python notebooks and files and SQL queries ([#2772](#2772)). This commit introduces the collection and storage of table usage information as part of linting jobs to enable tracking of legacy table usage and lineage. The changes include the modification of existing workflows, addition of new tables and views, and the introduction of new classes such as `UsedTablesCrawler`, `LineageAtom`, and `TableInfoNode`. The new classes and methods support tracking table usage and lineage in Python notebooks, files, and SQL queries. Unit tests and integration tests have been added and updated to ensure the correct functioning of this feature. This is the first pull request in a series of three, with the next two focusing on using the table information in queries and displaying results in the assessment dashboard. * Changed logic of direct filesystem access linting ([#2766](#2766)). This commit modifies the direct filesystem access (DFSA) linting logic to reduce false positives and improve precision. Previously, all string constants matching a DFSA pattern were detected, with false positives filtered on a case-by-case basis. The new approach narrows DFSA detection to instances originating from `spark` or `dbutils` modules, ensuring relevance and minimizing false alarms. The commit introduces new methods, such as 'is_builtin()' and 'get_call_name()', to determine if a given node is a built-in or not. Additionally, it includes unit tests and updates to the test cases in `test_directfs.py` to reflect the new detection criteria. This change enhances the linting process and enables developers to maintain better control over direct filesystem access within the `spark` and `dbutils` modules. * Fixed integration issue when collecting tables ([#2817](#2817)). In this release, we have addressed integration issues related to table collection in the Databricks Labs UCX project. We have introduced a new `UsedTablesCrawler` class to crawl tables in paths and queries, which resolves issues reported in tickets [#2800](#2800) and [#2808](#2808). Additionally, we have updated the `directfs_access_crawler_for_paths` and `directfs_access_crawler_for_queries` methods to work with the new `UsedTablesCrawler` class. We have also made changes to the `workflow_linter` method to include the new `used_tables_crawler_for_paths` property. Furthermore, we have refactored the `lint` method of certain classes to a `collect_tables` method, which returns an iterable of `UsedTable` objects to improve table collection. The `lint` method now processes the collected tables and raises advisories as needed, while the `apply` method remains unchanged. Integration tests were executed as part of this commit. * Increase test coverage ([#2818](#2818)). In this update, we have expanded the test suite for the `Tree` class in our Python AST codebase with several new unit tests. These tests are designed to verify various behaviors, including checking for `None` returns, validating string truncation, ensuring `NotImplementedError` exceptions are raised during node appending and method calls, and testing the correct handling of global variables. Additionally, we have included tests that ensure a constant is not from a specific module. This enhancement signifies our dedication to improving test coverage and consistency, which will aid in maintaining code quality, detecting unintended side effects, and preventing regressions in future development efforts. * Strip preliminary comments in pip cells ([#2763](#2763)). In this release, we have addressed an issue in the processing of pip commands preceded by non-MAGIC comments, ensuring that pip-based library management in Databricks notebooks functions correctly. The changes include stripping preliminary comments and handling the case where the pip command is preceded by a single '%' or '!'. Additionally, a new unit test has been added to validate the behavior of a notebook containing a malformed pip cell. This test checks that the notebook can still be parsed and built into a dependency graph without issues, even in the presence of non-MAGIC comments preceding the pip install command. The code for the test is written in Python and uses the Notebook, Dependency, and DependencyGraph classes to parse the notebook and build the dependency graph. The overall functionality of the code remains unchanged, and the code now correctly processes pip commands in the presence of non-MAGIC comments. * Temporarily ignore `MANAGED` HMS tables on external storage location ([#2837](#2837)). This release introduces changes to the behavior of the `_migrate_external_table` method in the `table_migrate.py` file, specifically for handling managed tables located on external storage. Previously, the method attempted to migrate any external table, but with this change, it now checks if the object type is 'MANAGED'. If it is, a warning message is logged, and the migration is skipped due to UCX's lack of support for migrating managed tables on external storage. This change affects the existing workflow, specifically the behavior of the `migrate_dbfs_root_tables` function in the HMS table migration test suite. The function now checks for the absence of certain SQL queries, specifically those involving `SYNC TABLE` and `ALTER TABLE`, in the `backend.queries` list to ensure that queries related to managed tables on external storage locations are excluded. This release includes unit tests and integration tests to verify the changes and ensure proper behavior for the modified workflow. Issue [#2838](#2838) has been resolved with this commit. * Updated sqlglot requirement from <25.23,>=25.5.0 to >=25.5.0,<25.25 ([#2765](#2765)). In this release, we have updated the sqlglot requirement in the pyproject.toml file to allow for any version greater than or equal to 25.5.0 but less than 25.25. This resolves a conflict in the previous requirement, which ranged from >=25.5.0 to <25.23. The update includes several bug fixes, refactors, and new features, such as support for the OVERLAY function in PostgreSQL and a flag to automatically exclude Keep diff nodes. Additionally, the check_deploy job has been simplified, and the supported dialect count has increased from 21 to 23. This update ensures that the project remains up-to-date and compatible with the latest version of sqlglot, while also improving functionality and stability. * Whitelists catalogue library ([#2780](#2780)). In this release, we've implemented a change to whitelist the catalogue library, which partially addresses issue [#193](#193). This improvement allows for the reliable and secure use of the catalogue library in our open-source project. The whitelisting ensures that any potential security threats originating from this library are mitigated, enhancing the overall security of our software. This enhancement also promotes better code maintainability and readability, making it easier for software engineers to understand the library's role in the project. By addressing this issue, our library becomes more robust, dependable, and maintainable for both current and future developments. * Whitelists circuitbreaker ([#2783](#2783)). A circuit breaker pattern has been implemented in the library to enhance fault tolerance and prevent cascading failures by introducing a delay before retrying requests to a failed service. This feature is configurable and allows users to specify which services should be protected by the circuit breaker pattern via a whitelist in the `known.json` configuration file. A new entry for `circuitbreaker` is added to the configuration, containing an empty list for the circuit breaker whitelist. This development partially addresses issue [#1931](#1931), aimed at improving system resilience and fault tolerance, and is a significant stride towards building a more robust and reliable open-source library. * Whitelists cloudpathlib ([#2784](#2784)). In this release, we have whitelisted the cloudpathlib library by adding it to the known.json file. Cloudpathlib is a Python library for manipulating cloud paths, and includes several modules for interacting with various cloud storage systems. Each module has been added to the known.json file with an empty list, indicating that no critical issues have been found in these modules. However, we have added warnings for the use of direct filesystem references in specific classes and methods within the cloudpathlib.azure.azblobclient, cloudpathlib.azure.azblobpath, cloudpathlib.cloudpath, cloudpathlib.gs.gsclient, cloudpathlib.gs.gspath, cloudpathlib.local.implementations.azure, cloudpathlib.local.implementations.gs, cloudpathlib.local.implementations.s3, cloudpathlib.s3.s3client, and cloudpathlib.s3.sspath modules. The warning message indicates that the use of direct filesystem references is deprecated and will be removed in a future release. This change addresses a portion of issue [#1931](#1931). * Whitelists colorful ([#2785](#2785)). In this release, we have added support for the `colorful` library, a Python package for generating ANSI escape codes to colorize terminal output. The library contains several modules, including "ansi", "colors", "core", "styles", "terminal", and "utils", all of which have been whitelisted and added to the "known.json" file. This change resolves issue [#1931](#1931) and broadens the range of approved libraries that can be used in the project, enabling more flexible and visually appealing terminal output. * Whitelists cymem ([#2793](#2793)). In this release, we have made changes to the known.json file to whitelist the use of the cymem package in our project. This new entry includes sub-entries such as "cymem", "cymem.about", "cymem.tests", and "cymem.tests.test_import", which likely correspond to specific components or aspects of the package that require whitelisting. This change partially addresses issue [#1931](#1931), which may have been caused by the use or testing of the cymem package. It is important to note that this commit does not modify any existing functionality or add any new methods; rather, it simply grants permission for the cymem package to be used in our project. * Whitelists dacite ([#2795](#2795)). In this release, we have whitelisted the dacite library in our known.json file. Dacite is a library that enables the instantiation of Python classes with type hints, providing more robust and flexible object creation. By whitelisting dacite, users of our project can now utilize this library in their code without encountering any compatibility issues. This change partially addresses issue [#1931](#1931), which may have involved dacite or type hinting more generally, thereby enhancing the overall functionality and flexibility of our project for software engineers. * Whitelists databricks-automl-runtime ([#2794](#2794)). A new change has been implemented to whitelist the `databricks-automl-runtime` in the "known.json" file, enabling several nested packages and modules related to Databricks' auto ML runtime for forecasting and hyperparameter tuning. The newly added modules provide functionalities for data preprocessing and model training, including handling time series data, missing values, and one-hot encoding. This modification addresses a portion of issue [#1931](#1931), improving the library's compatibility with Databricks' auto ML runtime. * Whitelists dataclasses-json ([#2792](#2792)). A new configuration has been added to the "known.json" file, whitelisting the `dataclasses-json` library, which provides serialization and deserialization functionality to Python dataclasses. This change partially resolves issue [#1931](#1931) and introduces new methods for serialization and deserialization through this library. Additionally, the libraries `marshmallow` and its associated modules, as well as "typing-inspect," have also been whitelisted, adding further serialization and deserialization capabilities. It's important to note that these changes do not affect existing functionality, but instead provide new options for handling these data structures. * Whitelists dbl-tempo ([#2791](#2791)). A new library, dbl-tempo, has been whitelisted and is now approved for use in the project. This library provides functionality related to tempo, including interpolation, intervals, resampling, and utility methods. These new methods have been added to the known.json file, indicating that they are now recognized and approved for use. This change is critical for maintaining backward compatibility and project maintainability. It addresses part of issue [#1931](#1931) and ensures that any new libraries or methods are thoroughly vetted and documented before implementation. Software engineers are encouraged to familiarize themselves with the new library and its capabilities. * whitelist blis ([#2776](#2776)). In this release, we have added the high-performance computing library `blis` to our whitelist, partially addressing issue [#1931](#1931). The blis library is optimized for various CPU architectures and provides dense linear algebra capabilities, which can improve the performance of workloads that utilize these operations. With this change, the blis library and its components have been included in our system's whitelist, enabling users to leverage its capabilities. Familiarity with high-performance libraries and their impact on system performance is essential for software engineers, and the addition of blis to our whitelist is a testament to our commitment to providing optimal performance. * whitelists brotli ([#2777](#2777)). In this release, we have partially addressed issue [#1931](#1931) by adding support for the Brotli data compression algorithm in our project. The Brotli JSON object and an empty array for `brotli` have been added to the "known.json" configuration file to recognize and support its use. This change does not modify any existing functionality or introduce new methods, but rather whitelists Brotli as a supported algorithm for future use in the project. This enhancement allows for more flexibility and options when working with data compression, providing software engineers with an additional tool for optimization and performance improvements. Dependency updates: * Updated sqlglot requirement from <25.23,>=25.5.0 to >=25.5.0,<25.25 ([#2765](#2765)).
* Added `Farama-Notifications` to known list ([#2822](#2822)). A new configuration has been implemented in this release to integrate Farama-Notifications into the existing system, partially addressing issue [#193](#193) * Added `aiohttp-cors` library to known list ([#2775](#2775)). In this release, we have added the `aiohttp-cors` library to our project, providing asynchronous Cross-Origin Resource Sharing (CORS) handling for the `aiohttp` library. This addition enhances the robustness and flexibility of CORS management in our relevant projects. The library includes several new modules such as "aiohttp_cors", "aiohttp_cors.abc", "aiohttp_cors.cors_config", "aiohttp_cors.mixin", "aiohttp_cors.preflight_handler", "aiohttp_cors.resource_options", and "aiohttp_cors.urldispatcher_router_adapter", which offer functionalities for configuring and handling CORS in `aiohttp` applications. This change partially resolves issue [#1931](#1931) and further strengthens our application's security and cross-origin resource sharing capabilities. * Added `category-encoders` library to known list ([#2781](#2781)). In this release, we've added the `category-encoders` library to our supported libraries, which provides a variety of methods for encoding categorical variables as numerical data, including one-hot encoding and target encoding. This addition resolves part of issue [#1931](#1931), which concerned the support of this library. The library has been integrated into our system by adding a new entry for `category-encoders` in the known.json file, which contains several modules and classes corresponding to various encoding methods provided by the library. This enhancement enables software engineers to leverage the capabilities of `category-encoders` library to encode categorical variables more efficiently and effectively. * Added `cmdstanpy` to known list ([#2786](#2786)). In this release, we have added `cmdstanpy` and `stanio` libraries to our codebase. `cmdstanpy` is a Python library for interfacing with the Stan probabilistic programming language and has been added to the whitelist. This addition enables the use of `cmdstanpy`'s functionalities, including loading, inspecting, and manipulating Stan model objects, as well as running MCMC simulations. Additionally, we have included the `stanio` library, which provides functionality for reading and writing Stan data and model files. These additions enhance the codebase's capabilities for working with probabilistic models, offering expanded options for loading, manipulating, and simulating models written in Stan. * Added `confection` library to known list ([#2787](#2787)). In this release, the `confection` library, a lightweight, pure Python library for parsing and formatting cookies with two modules for working with cookie headers and utility functions, has been added to the known list of libraries and is now usable within the project. Additionally, several modules from the `srsly` library, a collection of serialization utilities for Python including support for JSON, MessagePack, cloudpickle, and Ruamel YAML, have been added to the known list of libraries, increasing the project's flexibility and functionality in handling serialized data. This partially resolves issue [#1931](#1931). * Added `configparser` library to known list ([#2796](#2796)). In this release, we have added support for the `configparser` library, addressing issue [#1931](#1931). `Configparser` is a standard Python library used for parsing configuration files. This change not only whitelists the library but also includes the "backports.configparser" and "backports.configparser.compat" modules, providing backward compatibility for older versions of Python. By recognizing and supporting the `configparser` library, users can now utilize it in their code with confidence, knowing that it is a known and supported library. This update also ensures that the backports for older Python versions are recognized, enabling users to leverage the library seamlessly, regardless of the Python version they are using. * Added `diskcache` library to known list ([#2790](#2790)). A new update has been made to include the `diskcache` library in our open-source library's known list, as detailed in the release notes. This addition brings in multiple modules, including `diskcache`, `diskcache.cli`, `diskcache.core`, `diskcache.djangocache`, `diskcache.persistent`, and `diskcache.recipes`. The `diskcache` library is a high-performance caching system, useful for a variety of purposes such as caching database queries, API responses, or any large data that needs frequent access. By adding the `diskcache` library to the known list, developers can now leverage its capabilities in their projects, partially addressing issue [#1931](#1931). * Added `dm-tree` library to known list ([#2789](#2789)). In this release, we have added the `dm-tree` library to our project's known list, enabling its integration and use within our software. The `dm-tree` library is a C++ API that provides functionalities for creating and manipulating tree data structures, with support for sequences and tree benchmarking. This addition expands our range of available data structures, addressing the lack of support for tree data structures and partially resolving issue [#1931](#1931), which may have been related to the integration of the `dm-tree` library. By incorporating this library, we aim to enhance our project's performance and versatility, providing software engineers with more options for handling tree data structures. * Added `evaluate` to known list ([#2821](#2821)). In this release, we have added the `evaluate` package and its dependent libraries to our open-source library. The `evaluate` package is a tool for evaluating and analyzing machine learning models, providing a consistent interface to various evaluation tasks. Its dependent libraries include `colorful`, `cmdstanpy`, `comm`, `eradicate`, `multiprocess`, and `xxhash`. The `colorful` library is used for colorizing terminal output, while `cmdstanpy` provides Python infrastructure for Stan, a platform for statistical modeling and high-performance statistical computation. The `comm` library is used for creating and managing IPython comms, and `eradicate` is used for removing unwanted columns from pandas DataFrame. The `multiprocess` library is used for spawning processes, and `xxhash` is used for the XXHash algorithms, which are used for fast hash computation. This addition partly resolves issue [#1931](#1931), providing enhanced functionality for evaluating machine learning models. * Added `future` to known list ([#2823](#2823)). In this commit, we have added the `future` module, a compatibility layer for Python 2 and Python 3, to the project's known list in the configuration file. This module provides a wide range of backward-compatible tools and fixers to smooth over the differences between the two major versions of Python. It includes numerous sub-modules such as "future.backports", "future.builtins", "future.moves", and "future.standard_library", among others, which offer backward-compatible features for various parts of the Python standard library. The commit also includes related modules like "libfuturize", "libpasteurize", and `past` and their respective sub-modules, which provide tools for automatically converting Python 2 code to Python 3 syntax. These additions enhance the project's compatibility with both Python 2 and Python 3, providing developers with an easier way to write cross-compatible code. By adding the `future` module and related tools, the project can take full advantage of the features and capabilities provided, simplifying the process of writing code that works on both versions of the language. * Added `google-api-core` to known list ([#2824](#2824)). In this commit, we have added the `google-api-core` and `proto-plus` packages to our codebase. The `google-api-core` package brings in a collection of modules for low-level support of Google Cloud services, such as client options, gRPC helpers, and retry mechanisms. This addition enables access to a wide range of functionalities for interacting with Google Cloud services. The `proto-plus` package includes protobuf-related modules, simplifying the handling and manipulation of protobuf messages. This package includes datetime helpers, enums, fields, marshaling utilities, message definitions, and more. These changes enhance the project's versatility, providing users with a more feature-rich environment for interacting with external services, such as those provided by Google Cloud. Users will benefit from the added functionality and convenience provided by these packages. * Added `google-auth-oauthlib` and dependent libraries to known list ([#2825](#2825)). In this release, we have added the `google-auth-oauthlib` and `requests-oauthlib` libraries and their dependencies to our repository to enhance OAuth2 authentication flow support. The `google-auth-oauthlib` library is utilized for Google's OAuth2 client authentication and authorization flows, while `requests-oauthlib` provides OAuth1 and OAuth2 support for the `requests` library. This change partially resolves the missing dependencies issue and improves the project's ability to handle OAuth2 authentication flows with Google and other providers. * Added `greenlet` to known list ([#2830](#2830)). In this release, we have added the `greenlet` library to the known list in the configuration file, addressing part of issue [#193](#193) * Added `gymnasium` to known list ([#2832](#2832)). A new update has been made to include the popular open-source `gymnasium` library in the project's configuration file. The library provides various environments, spaces, and wrappers for developing and testing reinforcement learning algorithms, and includes modules such as "gymnasium.core", "gymnasium.envs", "gymnasium.envs.box2d", "gymnasium.envs.classic_control", "gymnasium.envs.mujoco", "gymnasium.envs.phys2d", "gymnasium.envs.registration", "gymnasium.envs.tabular", "gymnasium.envs.toy_text", "gymnasium.experimental", "gymnasium.logger", "gymnasium.spaces", and "gymnasium.utils", each with specific functionality. This addition enables developers to utilize the library without having to modify any existing code and take advantage of the latest features and bug fixes. This change partly addresses issue [#1931](#1931), likely related to using `gymnasium` in the project, allowing developers to now use it for developing and testing reinforcement learning algorithms. * Added and populate UCX `workflow_runs` table ([#2754](#2754)). In this release, we have added and populated a new `workflow_runs` table in the UCX project to track the status of workflow runs and handle concurrent writes. This update resolves issue [#2600](#2600) and is accompanied by modifications to the `migration-process-experimental` workflow, new `WorkflowRunRecorder` and `ProgressTrackingInstallation` classes, and updated user documentation. We have also added unit tests, integration tests, and a `record_workflow_run` method in the `MigrationWorkflow` class. The new table and methods have been tested to ensure they correctly record workflow run information. However, there are still some issues to address, such as deciding on getting workflow run status from `parse_log_task`. * Added collection of used tables from Python notebooks and files and SQL queries ([#2772](#2772)). This commit introduces the collection and storage of table usage information as part of linting jobs to enable tracking of legacy table usage and lineage. The changes include the modification of existing workflows, addition of new tables and views, and the introduction of new classes such as `UsedTablesCrawler`, `LineageAtom`, and `TableInfoNode`. The new classes and methods support tracking table usage and lineage in Python notebooks, files, and SQL queries. Unit tests and integration tests have been added and updated to ensure the correct functioning of this feature. This is the first pull request in a series of three, with the next two focusing on using the table information in queries and displaying results in the assessment dashboard. * Changed logic of direct filesystem access linting ([#2766](#2766)). This commit modifies the direct filesystem access (DFSA) linting logic to reduce false positives and improve precision. Previously, all string constants matching a DFSA pattern were detected, with false positives filtered on a case-by-case basis. The new approach narrows DFSA detection to instances originating from `spark` or `dbutils` modules, ensuring relevance and minimizing false alarms. The commit introduces new methods, such as 'is_builtin()' and 'get_call_name()', to determine if a given node is a built-in or not. Additionally, it includes unit tests and updates to the test cases in `test_directfs.py` to reflect the new detection criteria. This change enhances the linting process and enables developers to maintain better control over direct filesystem access within the `spark` and `dbutils` modules. * Fixed integration issue when collecting tables ([#2817](#2817)). In this release, we have addressed integration issues related to table collection in the Databricks Labs UCX project. We have introduced a new `UsedTablesCrawler` class to crawl tables in paths and queries, which resolves issues reported in tickets [#2800](#2800) and [#2808](#2808). Additionally, we have updated the `directfs_access_crawler_for_paths` and `directfs_access_crawler_for_queries` methods to work with the new `UsedTablesCrawler` class. We have also made changes to the `workflow_linter` method to include the new `used_tables_crawler_for_paths` property. Furthermore, we have refactored the `lint` method of certain classes to a `collect_tables` method, which returns an iterable of `UsedTable` objects to improve table collection. The `lint` method now processes the collected tables and raises advisories as needed, while the `apply` method remains unchanged. Integration tests were executed as part of this commit. * Increase test coverage ([#2818](#2818)). In this update, we have expanded the test suite for the `Tree` class in our Python AST codebase with several new unit tests. These tests are designed to verify various behaviors, including checking for `None` returns, validating string truncation, ensuring `NotImplementedError` exceptions are raised during node appending and method calls, and testing the correct handling of global variables. Additionally, we have included tests that ensure a constant is not from a specific module. This enhancement signifies our dedication to improving test coverage and consistency, which will aid in maintaining code quality, detecting unintended side effects, and preventing regressions in future development efforts. * Strip preliminary comments in pip cells ([#2763](#2763)). In this release, we have addressed an issue in the processing of pip commands preceded by non-MAGIC comments, ensuring that pip-based library management in Databricks notebooks functions correctly. The changes include stripping preliminary comments and handling the case where the pip command is preceded by a single '%' or '!'. Additionally, a new unit test has been added to validate the behavior of a notebook containing a malformed pip cell. This test checks that the notebook can still be parsed and built into a dependency graph without issues, even in the presence of non-MAGIC comments preceding the pip install command. The code for the test is written in Python and uses the Notebook, Dependency, and DependencyGraph classes to parse the notebook and build the dependency graph. The overall functionality of the code remains unchanged, and the code now correctly processes pip commands in the presence of non-MAGIC comments. * Temporarily ignore `MANAGED` HMS tables on external storage location ([#2837](#2837)). This release introduces changes to the behavior of the `_migrate_external_table` method in the `table_migrate.py` file, specifically for handling managed tables located on external storage. Previously, the method attempted to migrate any external table, but with this change, it now checks if the object type is 'MANAGED'. If it is, a warning message is logged, and the migration is skipped due to UCX's lack of support for migrating managed tables on external storage. This change affects the existing workflow, specifically the behavior of the `migrate_dbfs_root_tables` function in the HMS table migration test suite. The function now checks for the absence of certain SQL queries, specifically those involving `SYNC TABLE` and `ALTER TABLE`, in the `backend.queries` list to ensure that queries related to managed tables on external storage locations are excluded. This release includes unit tests and integration tests to verify the changes and ensure proper behavior for the modified workflow. Issue [#2838](#2838) has been resolved with this commit. * Updated sqlglot requirement from <25.23,>=25.5.0 to >=25.5.0,<25.25 ([#2765](#2765)). In this release, we have updated the sqlglot requirement in the pyproject.toml file to allow for any version greater than or equal to 25.5.0 but less than 25.25. This resolves a conflict in the previous requirement, which ranged from >=25.5.0 to <25.23. The update includes several bug fixes, refactors, and new features, such as support for the OVERLAY function in PostgreSQL and a flag to automatically exclude Keep diff nodes. Additionally, the check_deploy job has been simplified, and the supported dialect count has increased from 21 to 23. This update ensures that the project remains up-to-date and compatible with the latest version of sqlglot, while also improving functionality and stability. * Whitelists catalogue library ([#2780](#2780)). In this release, we've implemented a change to whitelist the catalogue library, which partially addresses issue [#193](#193). This improvement allows for the reliable and secure use of the catalogue library in our open-source project. The whitelisting ensures that any potential security threats originating from this library are mitigated, enhancing the overall security of our software. This enhancement also promotes better code maintainability and readability, making it easier for software engineers to understand the library's role in the project. By addressing this issue, our library becomes more robust, dependable, and maintainable for both current and future developments. * Whitelists circuitbreaker ([#2783](#2783)). A circuit breaker pattern has been implemented in the library to enhance fault tolerance and prevent cascading failures by introducing a delay before retrying requests to a failed service. This feature is configurable and allows users to specify which services should be protected by the circuit breaker pattern via a whitelist in the `known.json` configuration file. A new entry for `circuitbreaker` is added to the configuration, containing an empty list for the circuit breaker whitelist. This development partially addresses issue [#1931](#1931), aimed at improving system resilience and fault tolerance, and is a significant stride towards building a more robust and reliable open-source library. * Whitelists cloudpathlib ([#2784](#2784)). In this release, we have whitelisted the cloudpathlib library by adding it to the known.json file. Cloudpathlib is a Python library for manipulating cloud paths, and includes several modules for interacting with various cloud storage systems. Each module has been added to the known.json file with an empty list, indicating that no critical issues have been found in these modules. However, we have added warnings for the use of direct filesystem references in specific classes and methods within the cloudpathlib.azure.azblobclient, cloudpathlib.azure.azblobpath, cloudpathlib.cloudpath, cloudpathlib.gs.gsclient, cloudpathlib.gs.gspath, cloudpathlib.local.implementations.azure, cloudpathlib.local.implementations.gs, cloudpathlib.local.implementations.s3, cloudpathlib.s3.s3client, and cloudpathlib.s3.sspath modules. The warning message indicates that the use of direct filesystem references is deprecated and will be removed in a future release. This change addresses a portion of issue [#1931](#1931). * Whitelists colorful ([#2785](#2785)). In this release, we have added support for the `colorful` library, a Python package for generating ANSI escape codes to colorize terminal output. The library contains several modules, including "ansi", "colors", "core", "styles", "terminal", and "utils", all of which have been whitelisted and added to the "known.json" file. This change resolves issue [#1931](#1931) and broadens the range of approved libraries that can be used in the project, enabling more flexible and visually appealing terminal output. * Whitelists cymem ([#2793](#2793)). In this release, we have made changes to the known.json file to whitelist the use of the cymem package in our project. This new entry includes sub-entries such as "cymem", "cymem.about", "cymem.tests", and "cymem.tests.test_import", which likely correspond to specific components or aspects of the package that require whitelisting. This change partially addresses issue [#1931](#1931), which may have been caused by the use or testing of the cymem package. It is important to note that this commit does not modify any existing functionality or add any new methods; rather, it simply grants permission for the cymem package to be used in our project. * Whitelists dacite ([#2795](#2795)). In this release, we have whitelisted the dacite library in our known.json file. Dacite is a library that enables the instantiation of Python classes with type hints, providing more robust and flexible object creation. By whitelisting dacite, users of our project can now utilize this library in their code without encountering any compatibility issues. This change partially addresses issue [#1931](#1931), which may have involved dacite or type hinting more generally, thereby enhancing the overall functionality and flexibility of our project for software engineers. * Whitelists databricks-automl-runtime ([#2794](#2794)). A new change has been implemented to whitelist the `databricks-automl-runtime` in the "known.json" file, enabling several nested packages and modules related to Databricks' auto ML runtime for forecasting and hyperparameter tuning. The newly added modules provide functionalities for data preprocessing and model training, including handling time series data, missing values, and one-hot encoding. This modification addresses a portion of issue [#1931](#1931), improving the library's compatibility with Databricks' auto ML runtime. * Whitelists dataclasses-json ([#2792](#2792)). A new configuration has been added to the "known.json" file, whitelisting the `dataclasses-json` library, which provides serialization and deserialization functionality to Python dataclasses. This change partially resolves issue [#1931](#1931) and introduces new methods for serialization and deserialization through this library. Additionally, the libraries `marshmallow` and its associated modules, as well as "typing-inspect," have also been whitelisted, adding further serialization and deserialization capabilities. It's important to note that these changes do not affect existing functionality, but instead provide new options for handling these data structures. * Whitelists dbl-tempo ([#2791](#2791)). A new library, dbl-tempo, has been whitelisted and is now approved for use in the project. This library provides functionality related to tempo, including interpolation, intervals, resampling, and utility methods. These new methods have been added to the known.json file, indicating that they are now recognized and approved for use. This change is critical for maintaining backward compatibility and project maintainability. It addresses part of issue [#1931](#1931) and ensures that any new libraries or methods are thoroughly vetted and documented before implementation. Software engineers are encouraged to familiarize themselves with the new library and its capabilities. * whitelist blis ([#2776](#2776)). In this release, we have added the high-performance computing library `blis` to our whitelist, partially addressing issue [#1931](#1931). The blis library is optimized for various CPU architectures and provides dense linear algebra capabilities, which can improve the performance of workloads that utilize these operations. With this change, the blis library and its components have been included in our system's whitelist, enabling users to leverage its capabilities. Familiarity with high-performance libraries and their impact on system performance is essential for software engineers, and the addition of blis to our whitelist is a testament to our commitment to providing optimal performance. * whitelists brotli ([#2777](#2777)). In this release, we have partially addressed issue [#1931](#1931) by adding support for the Brotli data compression algorithm in our project. The Brotli JSON object and an empty array for `brotli` have been added to the "known.json" configuration file to recognize and support its use. This change does not modify any existing functionality or introduce new methods, but rather whitelists Brotli as a supported algorithm for future use in the project. This enhancement allows for more flexibility and options when working with data compression, providing software engineers with an additional tool for optimization and performance improvements. Dependency updates: * Updated sqlglot requirement from <25.23,>=25.5.0 to >=25.5.0,<25.25 ([#2765](#2765)).
Changes
Users need the ability to track legacy table usage and lineage.
This PR collects and stores table infos as part of linting jobs.
This PR will be followed by 2 PRs for usage in queries, and for displaying results in the assessment dashboard.
Linked issues
None
Functionality
assessment
Tests