-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor pipeline interface #901
Conversation
FYI, the docs pipeline is failing because we reference the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mrchtr, great work!
I haven't reviewed the tests yet (the PR is quite big 😅), but already left my comments on the code.
src/fondant/cli.py
Outdated
f"Found multiple instantiated datasets in {module_str}. Use the first dataset to start " | ||
f"the execution." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"Found multiple instantiated datasets in {module_str}. Use the first dataset to start " | |
f"the execution." | |
f"Found multiple instantiated datasets in {module_str}. Using the first dataset to start " | |
f"the execution." |
Shouldn't we use the last dataset? Or should it not matter? And if it doesn't matter, should we even log this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked it again. We don't need it since we don't create a deepcopy of the graph. Every dataset will contain the same computational graph after applying a new operations. We are not able to create a deepcopy since the graph contains some mappingproxy objects. Shouldn't be a problem for now, if we just allow a single dataset object in a module.
For the case we allow something like this:
dataset_1 = Dataset.read(...)
dataset_2 = dataset_1.apply(...)
dataset_3 = dataset_2.apply(...)
we will run into issues. For now we could add a validation that every dataset of a module contains the same graph, wdyt?
First steps for the implementation of the new dataset interface: - Removed Pipeline class - Added Workspace singleton to hold pipeline name, base_path, etc. .. (shouldn't be the focus of this PR) - Moved `Pipeline.read(..)` to Dataset class
Implements the new workflow including initialising datasets from manifest. - removes workspace again, using working_directory argument now to enable something like `fondant run ... --working_directory gs://...` - removes base_path from the manifest and changes the field location to absolute paths - changes the executor to generate absolute field locations - add `Dataset.read()` for initialising dataset from manifest path - adjust compiler code Try to create sketch out the flow and changes. Still have to fix the test cases and test the changes end to end for all runners. ![image](https://github.com/ml6team/fondant/assets/15777729/8c330daf-2076-493c-aea2-b7d3f85884b8)
4ac7a0b
to
59f6de3
Compare
Co-authored-by: Matthias Richter <matthias.r1092@gmail.com>
59f6de3
to
a6561be
Compare
Feature branch for refactoring the pipeline interface.
Going to create separate PRs to maintain an overview of all changes and avoid breaking the main branch.