Fix distributed tests #1451

dreadatour · 2025-11-06T14:32:04Z

Alternative to #1438
See also SaaS PRs.

Summary by Sourcery

Fix distributed tests by centralizing job creation and environment setup, refactoring Celery worker orchestration, skipping incompatible tests in distributed mode, standardizing test fixtures, and updating CI workflow for unified FFmpeg installation and prebuilt datachain_server venv initialization.

Enhancements:

Centralize job creation in tests with a new _create_job helper and propagate DATACHAIN_JOB_ID across fixtures
Refactor run_datachain_worker to spawn multiple Celery workers with dynamic queues and set DATACHAIN_STEP_ID and UDF_RUNNER_QUEUE_NAME_LIST environment variables
Update cloud_server_credentials fixture to session scope and use os.environ directly for AWS credentials
Standardize SQLiteMetastore and SQLiteWarehouse fixtures to accept string paths

CI:

Consolidate FFmpeg installation into a single apt-based step and add datachain venv initialization step to pre-build wheels and install dependencies

Tests:

Clear DATACHAIN_JOB_ID in job management tests via monkeypatch for isolation
Skip checkpoint tests when DATACHAIN_DISTRIBUTED is set
Simplify UDF tests to always expect RuntimeError in both local and distributed modes

sourcery-ai · 2025-11-06T14:32:12Z

Reviewer's Guide

This PR overhauls the distributed test infrastructure by centralizing job creation, refining environment variable management, improving the distributed worker fixture, adjusting test expectations, skipping incompatible tests in distributed mode, and enhancing the CI workflow with unified dependencies and a venv caching step.

File-Level Changes

Change	Details	Files
Introduce a shared job creation helper and refactor fixtures to set up DATACHAIN_JOB_ID consistently	Add _create_job helper function Refactor metastore and metastore_tmpfile fixtures to invoke _create_job and set env var Update datachain_job_id fixture to yield from _create_job Ensure SQLite db_file uses str(tmp_path)	`tests/conftest.py`
Standardize environment variable manipulation across tests	Replace monkeypatch.delenv/setenv calls with os.environ operations in cloud_server_credentials Add monkeypatch.delenv to clear DATACHAIN_JOB_ID in job management tests Unset DATACHAIN_DISTRIBUTED_DISABLED in test_get_distributed_class	`tests/conftest.py` `tests/unit/test_job_management.py` `tests/unit/test_catalog_loader.py`
Enhance the run_datachain_worker fixture for distributed execution	Remove explicit job_id parameter and skip when DATACHAIN_DISTRIBUTED is not set Generate unique UDF queues per worker and set related env vars Propagate os.environ to subprocess calls Set DATACHAIN_STEP_ID and UDF_RUNNER_QUEUE_NAME_LIST before yielding	`tests/conftest.py`
Simplify exception expectations in UDF functional tests	Always expect RuntimeError with message “UDF Execution Failed!” in test_udf scenarios	`tests/func/test_udf.py`
Skip checkpoint tests when running in distributed mode	Annotate unit and functional checkpoint tests with pytest.mark.skipif for DATACHAIN_DISTRIBUTED	`tests/unit/lib/test_checkpoints.py` `tests/func/test_checkpoints.py`
Streamline CI workflow with unified dependency installs and venv caching	Consolidate FFmpeg installation into a single step Add datachain venv initialization with pip wheel caching Install worker requirements via uv using cached wheels	`.github/workflows/tests-studio.yml`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes - here's some feedback:

There’s a lot of repeated monkeypatch.delenv("DATACHAIN_JOB_ID") in the unit tests—consider introducing a dedicated fixture to clear that env var automatically and reduce boilerplate.
The cloud_server_credentials fixture mixes direct os.environ mutations with monkeypatch calls; switching entirely to monkeypatch.setenv/delenv will keep test isolation more consistent.
The CI workflow’s virtualenv and wheel-building steps are quite verbose—extracting them into a reusable GitHub Action or caching the built venv could simplify the pipeline and speed up builds.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- There’s a lot of repeated monkeypatch.delenv("DATACHAIN_JOB_ID") in the unit tests—consider introducing a dedicated fixture to clear that env var automatically and reduce boilerplate.
- The cloud_server_credentials fixture mixes direct os.environ mutations with monkeypatch calls; switching entirely to monkeypatch.setenv/delenv will keep test isolation more consistent.
- The CI workflow’s virtualenv and wheel-building steps are quite verbose—extracting them into a reusable GitHub Action or caching the built venv could simplify the pipeline and speed up builds.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Copilot

Pull Request Overview

This PR updates test infrastructure to handle distributed mode execution and improves test isolation by ensuring clean environment state. The key changes focus on preventing test interference from environment variables and skipping checkpoint tests in distributed mode.

Key changes:

Added monkeypatch.delenv("DATACHAIN_JOB_ID", raising=False) to multiple test functions to ensure clean job ID state
Added @pytest.mark.skipif decorators to checkpoint tests to skip them in distributed mode
Modified fixtures to properly set up job IDs and create unique worker queues for distributed testing
Simplified error handling in UDF tests to expect consistent RuntimeError behavior

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/unit/test_job_management.py	Added monkeypatch parameter and job ID cleanup to ensure test isolation
tests/unit/test_catalog_loader.py	Added environment cleanup for distributed mode testing
tests/unit/lib/test_checkpoints.py	Added skipif decorators to skip checkpoint tests in distributed mode
tests/func/test_checkpoints.py	Added skipif decorator to skip checkpoint test in distributed mode
tests/func/test_udf.py	Simplified error handling to expect consistent RuntimeError behavior
tests/conftest.py	Refactored fixtures to set up job IDs properly and create unique worker queues; changed cloud_server_credentials scope to session
.github/workflows/tests-studio.yml	Consolidated FFmpeg installation and added datachain venv initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-06T14:35:20Z

tests/unit/lib/test_checkpoints.py

+@pytest.mark.skipif(
+    "os.environ.get('DATACHAIN_DISTRIBUTED')",
+    reason="Checkpoints test skipped in distributed mode",
+)