Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
a55645c
Add WorkspaceTablesLinter for extracting tables used in notebooks
pritishpai Sep 25, 2025
780f309
Add function to get language from path
pritishpai Sep 25, 2025
84bb967
Simple integration test
pritishpai Sep 26, 2025
0fdfddb
used_tables_for_workspace_crawler
pritishpai Sep 26, 2025
eb8332d
Add used_tables_for_workspace crawler
pritishpai Sep 26, 2025
5a6d857
Add scan_workspace_for_tables entrypoint for WorkspaceTablesLinter an…
pritishpai Sep 26, 2025
f333e34
Add _discover_workspace_objects and cleanup for integration tests
pritishpai Sep 26, 2025
579d0fa
Add functions to extract tables from path, parallelize the process pe…
pritishpai Sep 26, 2025
bb02a5e
Logging
pritishpai Sep 26, 2025
e08e983
Add more logging for integration test
pritishpai Sep 26, 2025
c702d74
Fix the WorkspaceObjectInfo constructor call
pritishpai Sep 26, 2025
8451aa2
Add funtions to extract tables from notebooks and files
pritishpai Sep 26, 2025
774f57c
Remove unused imports
pritishpai Sep 26, 2025
7b73283
Upload notebook content properly
pritishpai Sep 26, 2025
ff5f046
Add _extract_tables_from_notebook
pritishpai Sep 26, 2025
8f28f97
Succesfully crawl read tables
pritishpai Sep 29, 2025
8fbc9d5
Add verification of tables crawled are fetched from the inventory table
pritishpai Sep 29, 2025
53059a4
Insert crawled tables in the inventory database
pritishpai Sep 29, 2025
df44d79
Comment
pritishpai Sep 29, 2025
2fde3b1
Use empty TableMigrationIndex
pritishpai Sep 29, 2025
165d6a4
Add a function to detect if it is a dataframe
pritishpai Sep 29, 2025
2ee7405
Add some unit tests
pritishpai Sep 29, 2025
e0113aa
Fmt changes
pritishpai Sep 29, 2025
c304da9
Fmt changes
pritishpai Sep 29, 2025
20449fa
Create common function _get_str_content_from_path
pritishpai Sep 29, 2025
a67ab22
Remove generated unit tests for now
pritishpai Sep 29, 2025
302c743
Fmt
pritishpai Sep 29, 2025
11892b2
Fmt changes and cleanup for integration test
pritishpai Sep 29, 2025
0e07783
Remove unused code
pritishpai Sep 29, 2025
bf2908b
Fmt changes and refactoring
pritishpai Sep 30, 2025
ae2442f
Basic unit testing
pritishpai Sep 30, 2025
3334234
fix unit tests
pritishpai Oct 1, 2025
468a132
Add a new workflow
pritishpai Oct 1, 2025
a07e0ad
Change workflow name
pritishpai Oct 1, 2025
af312ff
Add additional integration test
pritishpai Oct 1, 2025
728584b
Change workflow name
pritishpai Oct 1, 2025
c6527e0
Add workflow to deploy list
pritishpai Oct 1, 2025
3049d4d
Strip '%sql' for proper linting
pritishpai Oct 1, 2025
f4c17d6
Fmt
pritishpai Oct 1, 2025
68694c1
Fmt
pritishpai Oct 1, 2025
c8a30b5
Add a placeholder readme
pritishpai Oct 1, 2025
a4345ef
Refactoring
pritishpai Oct 8, 2025
1bd17dd
Refactor workflow name and code structure
pritishpai Oct 8, 2025
9ae2e2d
Refactor class name
pritishpai Oct 9, 2025
763ec10
Refactor test name
pritishpai Oct 9, 2025
a95ccf0
Add a cli command for workspace code scanner
pritishpai Oct 9, 2025
a5bb05a
Add cli command to labs.yml
pritishpai Oct 9, 2025
fd5266f
fmt changes
pritishpai Oct 9, 2025
0bc5cb0
docs enhance
pritishpai Oct 10, 2025
5e67645
refactoring and adding cli command to docs
pritishpai Oct 10, 2025
6dce482
fmt
pritishpai Oct 10, 2025
9a6dacb
Fix sample query for per workspace area results
pritishpai Oct 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/ucx/docs/reference/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,4 @@ It includes the following:
- [Table Upgrade](/docs/reference/table_upgrade)
- [Troubleshooting Guide](/docs/reference/troubleshooting)
- [Workflows](/docs/reference/workflows)
- [Workspace Table Scanning](/docs/reference/workspace-table-scanning)
5 changes: 5 additions & 0 deletions docs/ucx/docs/reference/workflows/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,11 @@ The output is processed and displayed in the migration dashboard using the in `r
- run the [`create-table-mapping` command](/docs/reference/commands#create-table-mapping)
- or manually create a `mapping.csv` file in Workspace -> Applications -> ucx

## [EXPERIMENTAL] Workspace Code Scanner Workflow

The [`workspace-code-scanner-experimental`](/docs/reference/workspace-table-scanning) workflow scans all notebooks and files in the workspace for used tables in the workspace. The workflow performs a static analysis of the code to identify the tables and views used in the code. This is useful to identify schemas being used so that the assessment can be focused on those schemas. THe results are stored in 'used_tables_in_workspace' table in the inventory database.


## [EXPERIMENTAL] Migration Progress Workflow

The `migration-progress-experimental` workflow populates the tables visualized in the
Expand Down
161 changes: 161 additions & 0 deletions docs/ucx/docs/reference/workspace-table-scanning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Workspace Table Scanning

UCX now supports comprehensive table usage detection across your entire Databricks workspace, beyond just workflows and dashboards. This expanded capability allows you to discover all table references in notebooks and files within specified workspace paths.

## Overview

The new workspace scanning feature expands this to:
- **Workspace**: Tables used in any notebook or file within specified workspace paths

**Key Benefits:**
- **Discovery-first approach**: Runs as standalone workflow before assessment
- **Scope optimization**: Can limit Hive Metastore scanning to only databases that are referenced
- **Complete coverage**: Finds table usage beyond just workflows and dashboards
- **Independent execution**: Run on-demand without full assessment cycle

## How It Works

The workspace table scanner:

1. **Discovers Objects**: Recursively scans workspace paths to find all notebooks and supported files
2. **Analyzes Content**: Uses UCX's linting framework to extract table usage from each object
3. **Tracks Lineage**: Maintains detailed source lineage information for each table reference
4. **Stores Results**: Saves findings to the `used_tables_in_workspace` inventory table

## Supported File Types

The scanner supports:
- **Notebooks**: Python, SQL
- **Files**: Python (.py), SQL (.sql)

## Configuration

### Via Standalone Workflow

UCX now includes a dedicated `workspace-table-scanner` workflow that runs independently:

**Workflow Parameters:**
- `paths`: JSON list of workspace paths to scan (default: `["/"]`)

### Via CLI command
You can also run the scanner via the UCX CLI:

```bash
databricks ucx workspace-table-scanner --paths '["/Users", "/Shared"]'
```

### Programmatic Usage

```python
from databricks.labs.ucx.source_code.linters.workspace import WorkspaceTablesLinter
from databricks.labs.ucx.source_code.used_table import UsedTablesCrawler

# Initialize components
workspace_linter = WorkspaceTablesLinter(
ws=workspace_client,
sql_backend=sql_backend,
inventory_database="ucx_inventory",
path_lookup=path_lookup,
used_tables_crawler=UsedTablesCrawler.for_workspace(sql_backend, "ucx_inventory")
)

# Scan specific paths
workspace_paths = ["/Users/data_team", "/Shared/analytics"]
workspace_linter.scan_workspace_for_tables(workspace_paths)
```

## Typical Workflow Sequence

For optimal UCX assessment with scope optimization:

```bash
# 1. Run workspace-table-scanner first (standalone)

# 2. Use results to configure scope-limited assessment
# The scanner workflow will log suggested include_databases configuration

# 3. Update your UCX config with discovered databases
# include_databases: ["database1", "database2", "database3"]

# 4. Run assessment with optimized scope
databricks workflows run assessment


**Scope Optimization Example:**
```sql
-- Query to get databases for config
SELECT DISTINCT schema_name
FROM ucx_inventory.used_tables_in_workspace
WHERE catalog_name = 'hive_metastore'
ORDER BY schema_name;
```

## Results and Analysis

### Inventory Table

Results are stored in `{inventory_database}.used_tables_in_workspace` with the following schema:

| Column | Type | Description |
|--------|------|-------------|
| `catalog_name` | string | Catalog containing the table |
| `schema_name` | string | Schema containing the table |
| `table_name` | string | Name of the table |
| `source_id` | string | Path to the workspace object |
| `source_lineage` | array | Detailed lineage information |
| `is_write` | boolean | Whether this is a write operation |

### Example Queries

**Most used tables across workspace:**
```sql
SELECT
catalog_name,
schema_name,
table_name,
COUNT(*) as usage_count
FROM ucx_inventory.used_tables_in_workspace
GROUP BY catalog_name, schema_name, table_name
ORDER BY usage_count DESC;
```

**Table usage by workspace area:**
```sql
SELECT
CASE
WHEN source_id LIKE '%/Users/%' THEN 'User Notebooks'
WHEN source_id LIKE '%/Shared/%' THEN 'Shared Notebooks'
WHEN source_id LIKE '%/Repos/%' THEN 'Repository Code'
ELSE 'Other'
END as workspace_area,
COUNT(DISTINCT CONCAT(catalog_name, '.', schema_name, '.', table_name)) as unique_tables,
COUNT(*) as total_references
FROM ucx_inventory.used_tables_in_workspace
GROUP BY workspace_area;
```

**Files with the most table dependencies:**
```sql
SELECT
source_id,
COUNT(DISTINCT CONCAT(catalog_name, '.', schema_name, '.', table_name)) as table_count
FROM ucx_inventory.used_tables_in_workspace
GROUP BY source_id
ORDER BY table_count DESC
LIMIT 20;
```

## Best Practices

### Path Selection
- Start with critical paths like `/Shared/production` or specific team directories
- Avoid scanning entire workspace initially to gauge performance impact
- Exclude test/scratch directories to focus on production code

### Regular Scanning
- Run workspace scans weekly or monthly to track evolving dependencies
- Compare results over time to identify new table dependencies

### Result Analysis
- Combine workspace results with workflow and dashboard results for complete picture
- Use the lineage information to understand code relationships
10 changes: 10 additions & 0 deletions labs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,16 @@ commands:
description: (Optional) Whether to run the assess-workflows for the collection of workspaces with ucx
installed. Default is False.

- name: run-workspace-code-scanner
description: (Experimental) trigger the `workspace-code-scanner-experimental` job to scan the workspace code for fetching tables referenced in the codebase.
flags:
- name: paths
description: The workspace paths to the directory to scan.
- name: run-as-collection
description: (Optional) Whether to run the workspace-code-scanner for the collection of workspaces with ucx
installed. Default is False.


- name: update-migration-progress
description: trigger the `migration-progress-experimental` job to refresh the inventory that tracks the workspace
resources and their migration status.
Expand Down
22 changes: 22 additions & 0 deletions src/databricks/labs/ucx/assessment/workflows.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,3 +247,25 @@ def failing_task(self, ctx: RuntimeContext):
logger.warning("This is a test warning message.")
logger.error("This is a test error message.")
raise ValueError("This task is supposed to fail.")


class WorkspaceCodeScanner(Workflow):
def __init__(self):
super().__init__('workspace-code-scanner-experimental', [JobParameterDefinition(name="paths", default="")])

@job_task
def scan_workspace_code(self, ctx: RuntimeContext):
"""Scan workspace for table usage using WorkspaceTablesLinter."""
logger.info("Starting workspace table scanning")

# Get the path parameter and split by comma if multiple paths
path_param = ctx.named_parameters.get("paths", "")
if not path_param:
logger.error("No path parameter provided. Please provide a comma-separated list of paths to scan.")
else:
paths = [p.strip() for p in path_param.split(",") if p.strip()]

# Create and use the workspace linter
workspace_linter = ctx.workspace_tables_linter
workspace_linter.scan_workspace_for_tables(paths)
logger.info("Workspace table scanning completed and results stored in inventory database")
19 changes: 19 additions & 0 deletions src/databricks/labs/ucx/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,25 @@ def run_assess_workflows(
deployed_workflows.run_workflow("assess-workflows", skip_job_wait=run_as_collection)


@ucx.command
def run_workspace_code_scanner_experimental(
w: WorkspaceClient, run_as_collection: bool = False, a: AccountClient | None = None, paths: str | None = None
):
"""Manually trigger the workspace-code-scanner-experimental job."""
if paths is None:
logger.error("--paths is a required parameter.")
return

workspace_contexts = _get_workspace_contexts(w, a, run_as_collection)
for ctx in workspace_contexts:
workspace_id = ctx.workspace_client.get_workspace_id()
deployed_workflows = ctx.deployed_workflows
logger.info(f"Starting 'workspace-code-scanner-experimental' workflow in workspace: {workspace_id}")
deployed_workflows.run_workflow(
"workspace-code-scanner-experimental", named_parameters={"paths": paths}, skip_job_wait=run_as_collection
)


@ucx.command
def update_migration_progress(
w: WorkspaceClient,
Expand Down
15 changes: 15 additions & 0 deletions src/databricks/labs/ucx/contexts/application.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@
from databricks.labs.ucx.progress.install import VerifyProgressTracking
from databricks.labs.ucx.source_code.graph import DependencyResolver
from databricks.labs.ucx.source_code.linters.jobs import WorkflowLinter
from databricks.labs.ucx.source_code.linters.workspace import WorkspaceCodeLinter
from databricks.labs.ucx.source_code.known import KnownList
from databricks.labs.ucx.source_code.folders import FolderLoader
from databricks.labs.ucx.source_code.files import FileLoader, ImportFileResolver
Expand Down Expand Up @@ -610,6 +611,16 @@ def query_linter(self) -> QueryLinter:
self.config.debug_listing_upper_limit,
)

@cached_property
def workspace_tables_linter(self) -> WorkspaceCodeLinter:
return WorkspaceCodeLinter(
self.workspace_client,
self.sql_backend,
self.inventory_database,
self.path_lookup,
self.used_tables_crawler_for_workspace,
)

@cached_property
def directfs_access_crawler_for_paths(self) -> DirectFsAccessCrawler:
return DirectFsAccessCrawler.for_paths(self.sql_backend, self.inventory_database)
Expand All @@ -626,6 +637,10 @@ def used_tables_crawler_for_paths(self):
def used_tables_crawler_for_queries(self):
return UsedTablesCrawler.for_queries(self.sql_backend, self.inventory_database)

@cached_property
def used_tables_crawler_for_workspace(self):
return UsedTablesCrawler.for_workspace(self.sql_backend, self.inventory_database)

@cached_property
def redash(self) -> Redash:
return Redash(
Expand Down
3 changes: 2 additions & 1 deletion src/databricks/labs/ucx/runtime.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from databricks.sdk.config import with_user_agent_extra

from databricks.labs.ucx.__about__ import __version__
from databricks.labs.ucx.assessment.workflows import Assessment, Failing, AssessWorkflows
from databricks.labs.ucx.assessment.workflows import Assessment, Failing, AssessWorkflows, WorkspaceCodeScanner
from databricks.labs.ucx.contexts.workflow_task import RuntimeContext
from databricks.labs.ucx.framework.tasks import Workflow, parse_args
from databricks.labs.ucx.installer.logs import TaskLogger
Expand Down Expand Up @@ -52,6 +52,7 @@ def all(cls):
ConvertWASBSToADLSGen2(),
PermissionsMigrationAPI(),
MigrationRecon(),
WorkspaceCodeScanner(),
Failing(),
]
)
Expand Down
Loading
Loading