Skip to content

Conversation

@dreadatour
Copy link
Contributor

@dreadatour dreadatour commented Nov 6, 2025

Alternative to #1438
See also SaaS PRs.

Summary by Sourcery

Fix distributed tests by centralizing job creation and environment setup, refactoring Celery worker orchestration, skipping incompatible tests in distributed mode, standardizing test fixtures, and updating CI workflow for unified FFmpeg installation and prebuilt datachain_server venv initialization.

Enhancements:

  • Centralize job creation in tests with a new _create_job helper and propagate DATACHAIN_JOB_ID across fixtures
  • Refactor run_datachain_worker to spawn multiple Celery workers with dynamic queues and set DATACHAIN_STEP_ID and UDF_RUNNER_QUEUE_NAME_LIST environment variables
  • Update cloud_server_credentials fixture to session scope and use os.environ directly for AWS credentials
  • Standardize SQLiteMetastore and SQLiteWarehouse fixtures to accept string paths

CI:

  • Consolidate FFmpeg installation into a single apt-based step and add datachain venv initialization step to pre-build wheels and install dependencies

Tests:

  • Clear DATACHAIN_JOB_ID in job management tests via monkeypatch for isolation
  • Skip checkpoint tests when DATACHAIN_DISTRIBUTED is set
  • Simplify UDF tests to always expect RuntimeError in both local and distributed modes

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Nov 6, 2025

Reviewer's Guide

This PR overhauls the distributed test infrastructure by centralizing job creation, refining environment variable management, improving the distributed worker fixture, adjusting test expectations, skipping incompatible tests in distributed mode, and enhancing the CI workflow with unified dependencies and a venv caching step.

File-Level Changes

Change Details Files
Introduce a shared job creation helper and refactor fixtures to set up DATACHAIN_JOB_ID consistently
  • Add _create_job helper function
  • Refactor metastore and metastore_tmpfile fixtures to invoke _create_job and set env var
  • Update datachain_job_id fixture to yield from _create_job
  • Ensure SQLite db_file uses str(tmp_path)
tests/conftest.py
Standardize environment variable manipulation across tests
  • Replace monkeypatch.delenv/setenv calls with os.environ operations in cloud_server_credentials
  • Add monkeypatch.delenv to clear DATACHAIN_JOB_ID in job management tests
  • Unset DATACHAIN_DISTRIBUTED_DISABLED in test_get_distributed_class
tests/conftest.py
tests/unit/test_job_management.py
tests/unit/test_catalog_loader.py
Enhance the run_datachain_worker fixture for distributed execution
  • Remove explicit job_id parameter and skip when DATACHAIN_DISTRIBUTED is not set
  • Generate unique UDF queues per worker and set related env vars
  • Propagate os.environ to subprocess calls
  • Set DATACHAIN_STEP_ID and UDF_RUNNER_QUEUE_NAME_LIST before yielding
tests/conftest.py
Simplify exception expectations in UDF functional tests
  • Always expect RuntimeError with message “UDF Execution Failed!” in test_udf scenarios
tests/func/test_udf.py
Skip checkpoint tests when running in distributed mode
  • Annotate unit and functional checkpoint tests with pytest.mark.skipif for DATACHAIN_DISTRIBUTED
tests/unit/lib/test_checkpoints.py
tests/func/test_checkpoints.py
Streamline CI workflow with unified dependency installs and venv caching
  • Consolidate FFmpeg installation into a single step
  • Add datachain venv initialization with pip wheel caching
  • Install worker requirements via uv using cached wheels
.github/workflows/tests-studio.yml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@dreadatour dreadatour mentioned this pull request Nov 6, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • There’s a lot of repeated monkeypatch.delenv("DATACHAIN_JOB_ID") in the unit tests—consider introducing a dedicated fixture to clear that env var automatically and reduce boilerplate.
  • The cloud_server_credentials fixture mixes direct os.environ mutations with monkeypatch calls; switching entirely to monkeypatch.setenv/delenv will keep test isolation more consistent.
  • The CI workflow’s virtualenv and wheel-building steps are quite verbose—extracting them into a reusable GitHub Action or caching the built venv could simplify the pipeline and speed up builds.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- There’s a lot of repeated monkeypatch.delenv("DATACHAIN_JOB_ID") in the unit tests—consider introducing a dedicated fixture to clear that env var automatically and reduce boilerplate.
- The cloud_server_credentials fixture mixes direct os.environ mutations with monkeypatch calls; switching entirely to monkeypatch.setenv/delenv will keep test isolation more consistent.
- The CI workflow’s virtualenv and wheel-building steps are quite verbose—extracting them into a reusable GitHub Action or caching the built venv could simplify the pipeline and speed up builds.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates test infrastructure to handle distributed mode execution and improves test isolation by ensuring clean environment state. The key changes focus on preventing test interference from environment variables and skipping checkpoint tests in distributed mode.

Key changes:

  • Added monkeypatch.delenv("DATACHAIN_JOB_ID", raising=False) to multiple test functions to ensure clean job ID state
  • Added @pytest.mark.skipif decorators to checkpoint tests to skip them in distributed mode
  • Modified fixtures to properly set up job IDs and create unique worker queues for distributed testing
  • Simplified error handling in UDF tests to expect consistent RuntimeError behavior

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/unit/test_job_management.py Added monkeypatch parameter and job ID cleanup to ensure test isolation
tests/unit/test_catalog_loader.py Added environment cleanup for distributed mode testing
tests/unit/lib/test_checkpoints.py Added skipif decorators to skip checkpoint tests in distributed mode
tests/func/test_checkpoints.py Added skipif decorator to skip checkpoint test in distributed mode
tests/func/test_udf.py Simplified error handling to expect consistent RuntimeError behavior
tests/conftest.py Refactored fixtures to set up job IDs properly and create unique worker queues; changed cloud_server_credentials scope to session
.github/workflows/tests-studio.yml Consolidated FFmpeg installation and added datachain venv initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +27
@pytest.mark.skipif(
"os.environ.get('DATACHAIN_DISTRIBUTED')",
reason="Checkpoints test skipped in distributed mode",
)
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @pytest.mark.skipif condition uses a string expression but os is not imported in this file. This will cause a NameError at test collection time. Import os at the top of the file or use 'DATACHAIN_DISTRIBUTED' in os.environ with a proper import.

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +94
@pytest.mark.skipif(
"os.environ.get('DATACHAIN_DISTRIBUTED')",
reason="Checkpoints test skipped in distributed mode",
)
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @pytest.mark.skipif condition uses a string expression but os is not imported in this file. This will cause a NameError at test collection time. Import os at the top of the file or use 'DATACHAIN_DISTRIBUTED' in os.environ with a proper import.

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +129
@pytest.mark.skipif(
"os.environ.get('DATACHAIN_DISTRIBUTED')",
reason="Checkpoints test skipped in distributed mode",
)
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @pytest.mark.skipif condition uses a string expression but os is not imported in this file. This will cause a NameError at test collection time. Import os at the top of the file or use 'DATACHAIN_DISTRIBUTED' in os.environ with a proper import.

Copilot uses AI. Check for mistakes.
Comment on lines +195 to +198
@pytest.mark.skipif(
"os.environ.get('DATACHAIN_DISTRIBUTED')",
reason="Checkpoints test skipped in distributed mode",
)
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @pytest.mark.skipif condition uses a string expression but os is not imported in this file. This will cause a NameError at test collection time. Import os at the top of the file or use 'DATACHAIN_DISTRIBUTED' in os.environ with a proper import.

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +17
@pytest.mark.skipif(
"os.environ.get('DATACHAIN_DISTRIBUTED')",
reason="Checkpoints test skipped in distributed mode",
)
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @pytest.mark.skipif condition uses a string expression but os is not imported in this file. This will cause a NameError at test collection time. Import os at the top of the file or use 'DATACHAIN_DISTRIBUTED' in os.environ with a proper import.

Copilot uses AI. Check for mistakes.
_metastore.cleanup_for_tests()
else:
_metastore = SQLiteMetastore(db_file=tmp_path / "test.db")
_metastore = SQLiteMetastore(db_file=str(tmp_path / "test.db"))
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting Path to string is unnecessary here. The SQLiteMetastore constructor likely accepts Path objects directly. If the change from tmp_path / \"test.db\" to str(tmp_path / \"test.db\") is intentional to fix an issue, it would be helpful to understand why, but this appears inconsistent with other code that uses paths directly.

Suggested change
_metastore = SQLiteMetastore(db_file=str(tmp_path / "test.db"))
_metastore = SQLiteMetastore(db_file=tmp_path / "test.db")

Copilot uses AI. Check for mistakes.
_warehouse.cleanup_for_tests()
else:
_warehouse = SQLiteWarehouse(db_file=tmp_path / "test.db")
_warehouse = SQLiteWarehouse(db_file=str(tmp_path / "test.db"))
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting Path to string is unnecessary here. The SQLiteWarehouse constructor likely accepts Path objects directly. If the change from tmp_path / \"test.db\" to str(tmp_path / \"test.db\") is intentional to fix an issue, it would be helpful to understand why, but this appears inconsistent with other code that uses paths directly.

Suggested change
_warehouse = SQLiteWarehouse(db_file=str(tmp_path / "test.db"))
_warehouse = SQLiteWarehouse(db_file=tmp_path / "test.db")

Copilot uses AI. Check for mistakes.
PYTHON_VERSION: ${{ matrix.pyv }}
DATACHAIN_VENV_DIR: /tmp/local/datachain_venv/python${{ matrix.pyv }}
run: |
virtualenv -p "$(which python"${PYTHON_VERSION}")" "${DATACHAIN_VENV_DIR}"
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in the command substitution syntax. The correct syntax should be \"$(which python\"${PYTHON_VERSION}\")\" or more clearly \"$(which python${PYTHON_VERSION})\". The current syntax python\"${PYTHON_VERSION}\" will not properly interpolate the variable.

Suggested change
virtualenv -p "$(which python"${PYTHON_VERSION}")" "${DATACHAIN_VENV_DIR}"
virtualenv -p "$(which python${PYTHON_VERSION})" "${DATACHAIN_VENV_DIR}"

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Nov 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.96%. Comparing base (176b7cb) to head (8eec03f).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1451   +/-   ##
=======================================
  Coverage   87.96%   87.96%           
=======================================
  Files         160      160           
  Lines       15377    15377           
  Branches     2224     2224           
=======================================
  Hits        13527    13527           
  Misses       1336     1336           
  Partials      514      514           
Flag Coverage Δ
datachain 87.92% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dreadatour dreadatour merged commit a0bea01 into main Nov 7, 2025
76 of 85 checks passed
@dreadatour dreadatour deleted the fix-distributed-tests branch November 7, 2025 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants