Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mlflow implementation of Tracking Interface #768

Merged
merged 58 commits into from
Feb 5, 2025
Merged

Conversation

njbrake
Copy link
Contributor

@njbrake njbrake commented Jan 29, 2025

What's changing

Mlflow implementation of tracking interface to store and retrieve job results and workflow results.

This means we no longer use the ExperimentRepository for tracking experiments.

This PR also elevates workflow management code out of the Job service, and converts the _on_job_complete function into a "wait_for_complete" function this way we don't need to trigger nested background tasks: a workflow is run as a single background task, which encapsulates it into a single function of the workflows.py, to more easily accommodate an expanded range of workflow tasks (future work here).

This pull request includes several changes to the backend of the Lumigator project, focusing on enhancing the functionality of the experiment and workflow services, improving dependency management, and updating the Makefile for testing configurations. The most significant changes include the addition of tracking client dependencies, updates to experiment and workflow routes, and improvements to job creation processes.

Enhancements to Experiment and Workflow Services:

  • lumigator/backend/backend/api/deps.py: Added tracking_client_managerdependency and updatedget_experiment_serviceandget_workflow_serviceto includetracking_client` as a parameter.
  • Implemented the workflow service to be able to run the inf + eval workflow, which is the only type of workflow supported at the moment

Updates to Experiment and Workflow Routes:

  • lumigator/backend/backend/api/router.py: Included the workflows route in the OpenAPI schema.
  • lumigator/backend/backend/api/routes/experiments.py: Updated the response types for create_experiment_id and get_experiment_new routes, and added a new route to delete experiments.
  • lumigator/backend/backend/api/routes/workflows.py: Added new routes for getting workflow logs and deleting workflows, and updated the request type for creating workflows.

Improvements to Job Creation Processes:

  • lumigator/backend/backend/api/routes/jobs.py: Removed background_tasks parameter from job creation methods and added asynchronous background tasks for handling job completion and dataset updates.

Exception Handling Enhancements:

  • lumigator/backend/backend/main.py: Added workflow exception mappings to the FastAPI application.
  • lumigator/backend/backend/services/exceptions/experiment_exceptions.py: Updated ExperimentNotFoundError to use str instead of UUID for resource ID.
  • lumigator/backend/backend/services/exceptions/workflow_exceptions.py: Added WorkflowNotFoundError class for handling workflow-related exceptions.

Testing Configuration Updates:

  • Makefile: Added MLFLOW_TRACKING_URI environment variable to test-backend-unit and test-backend-integration targets.

If this PR is related to an issue or closes one, please link it here.

Deployment Updates:

  • Added mlflow to the main docker compose deployment

Refs #527

How to test it

This PR does not yet switch the frontend or SDK over to using workflows, so it can either be tested via curl commands or via the test. The test exercises the integration and makes sure that experiments/workflows/jobs can be created and deleted successfully, and that the logs can be accessed
Steps to test the changes:

  1. make local-up
    2 .make test-backend-integration

Additional notes for reviewers

I already...

  • Tested the changes in a working environment to ensure they work as expected
  • Added some tests for any new functionality
  • Updated the documentation (both comments in code and product documentation under /docs)
  • Checked if a (backend) DB migration step was required and included it if required

njbrake and others added 30 commits January 23, 2025 14:46
Signed-off-by: Nathan Brake <33383515+njbrake@users.noreply.github.com>
Copy link
Contributor

@javiermtorres javiermtorres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd insist on adding the list of JobCreate as workflow params. This will avoid making adaptations to the job and workflow params that would need to be removed later on. We can assume that only the [infer, eval] list will be sent, for the moment.

@github-actions github-actions bot added documentation Improvements or additions to documentation gha GitHub actions related labels Jan 31, 2025
Base automatically changed from 741-tracking-interface to main January 31, 2025 14:34
@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Jan 31, 2025
@njbrake njbrake linked an issue Jan 31, 2025 that may be closed by this pull request
1 task
@njbrake
Copy link
Contributor Author

njbrake commented Jan 31, 2025

I'd insist on adding the list of JobCreate as workflow params. This will avoid making adaptations to the job and workflow params that would need to be removed later on. We can assume that only the [infer, eval] list will be sent, for the moment.

👍
Documenting our convo from slack here: This Mlflow Implementation PR will wait for #740, since that PR is refactoring the JobCreation schemas. Following that PR being merged, we'll update this branch to be compatible with those changes, at which point we can merge.

Copy link
Contributor

@javiermtorres javiermtorres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a couple of caveats.

  • Adding the jobs directory directly in the PYTHONPATH causes a few issues when the job "pseudo package" has a deeper structure, and the backend starts importing job implementation packages
  • The worklow will need to expose the list of jobs configs but this can be done in First attempt at a parametrized JobCreate #740

@njbrake njbrake merged commit 81944e1 into main Feb 5, 2025
15 checks passed
@njbrake njbrake deleted the 527-mlflow-implementation branch February 5, 2025 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Changes which impact API/presentation layer backend dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation gha GitHub actions related schemas Changes to schemas (which may be public facing) sdk
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Experiment implementation using a tracking server
4 participants