Implementation new dataset interface #902

mrchtr · 2024-03-11T15:21:23Z

First steps for the implementation of the new dataset interface:

Removed Pipeline class
Added Workspace singleton to hold pipeline name, base_path, etc. .. (shouldn't be the focus of this PR)
Moved Pipeline.read(..) to Dataset class

tests/pipeline/test_pipeline.py

RobbeSneyders

Not sure if going with a singleton is the best approach here. It can make testing harder and can become a bit too magic. We might still want to inject it into the Dataset class instead.

src/fondant/cli.py

GeorgesLorre · 2024-03-18T13:12:47Z

Starting to look good @mrchtr

I agree that that the workspace concept might still need some further refinement. How we instantiate it and how we link it to an execution but this already serving us. We could refine it in a follow up PR or it might still be mergable in the runners/compilers.

The pipeline/Dataset changes are looking good, still some todo's but the mvp is here:

move the .read() method
add validation

I'm ok with merging this in the other branch and focusing on green pipelines. Then we can add the missing functionality.

mrchtr · 2024-03-18T14:27:54Z

Starting to look good @mrchtr

I agree that that the workspace concept might still need some further refinement. How we instantiate it and how we link it to an execution but this already serving us. We could refine it in a follow up PR or it might still be mergable in the runners/compilers.

The pipeline/Dataset changes are looking good, still some todo's but the mvp is here:

move the .read() method

add validation

I'm ok with merging this in the other branch and focusing on green pipelines. Then we can add the missing functionality.

Sounds like a plan! Going to fix the tests and then we can divide the work from there on.

mrchtr

I've added some notes to discuss the open topics. Feel free to have a look @GeorgesLorre. :)

mrchtr · 2024-03-20T10:26:28Z

src/fondant/dataset/dataset.py

+            raise InvalidWorkspaceDefinition(msg)
+        return name
+
+    def get_run_id(self) -> str:


Not sure if the caching is currently working.
The run id is used to determine changes. The workspace should hold run ids in the future as well. I could think about a combination of runner type and working directory.

I would not put more effort into it for this PR. we can test and fix separately

mrchtr · 2024-03-20T10:28:53Z

src/fondant/dataset/dataset.py


    def __init__(
        self,
        name: str,
-        *,
        base_path: str,


I would rename this into working_directory as part of #887 to make the difference between Dataset and Workspace more clear.
A Dataset hold all relevant information (e.g. name, description, manifest, potential metadata in the future).
A Workspace is used to determine the execution environment (e.g. runner information, working directories).

mrchtr · 2024-03-20T10:29:07Z

src/fondant/dataset/dataset.py

@@ -440,30 +440,62 @@ def get_nested_dict_hash(input_dict):
        return get_nested_dict_hash(component_op_uid_dict)


-class Pipeline:
-    """Class representing a Fondant Pipeline."""
+class Workspace:


I would do the clean up of this class as part of #887.

mrchtr · 2024-03-20T10:33:30Z

src/fondant/dataset/dataset.py

-            raise InvalidPipelineDefinition(
-                msg,
-            )
+        # TODO: add method call to retrieve workspace context, and make passing workspace optional


Every operation method (read, apply, write) takes the workspace as an argument.
@GeorgesLorre I guess we can use the workspace initialised in the cli to pass it somehow to these methods. Maybe as part of arguments? I would propose to clean this up as part of #887.

This is indeed not ideal. Ideally we inject these values at compile time. That way we can keep the datasets separate from the workspace.

mrchtr · 2024-03-20T10:40:39Z

src/fondant/cli.py

+    # use workspace from cli command
+    # if args.workspace exists
+
+    workspace = getattr(args, "workspace", None)


I've added already the workspace to the cli. For now, the implementation is using a default workspace.
The run command loads the dataset object of the file (former pipeline) and executes the operations.

As part of #887 I would handle a proper default workspace initialisation, and we can think about refer to a workspace file (for instance .workspace/local.env containing information of environment and runner) or a definition inside the source code.
At least something like this fondant run local ... --workspace ./local.env feels natural.

First steps for the implementation of the new dataset interface: - Removed Pipeline class - Added Workspace singleton to hold pipeline name, base_path, etc. .. (shouldn't be the focus of this PR) - Moved `Pipeline.read(..)` to Dataset class

mrchtr requested review from RobbeSneyders and GeorgesLorre and removed request for RobbeSneyders March 11, 2024 15:21

mrchtr commented Mar 11, 2024

View reviewed changes

tests/pipeline/test_pipeline.py Show resolved Hide resolved

RobbeSneyders reviewed Mar 11, 2024

View reviewed changes

src/fondant/cli.py Outdated Show resolved Hide resolved

src/fondant/cli.py Outdated Show resolved Hide resolved

mrchtr mentioned this pull request Mar 19, 2024

Refactor Pipeline interface #886

Closed

mrchtr commented Mar 20, 2024

View reviewed changes

mrchtr added 5 commits March 20, 2024 16:21

Remove pipeline interface

7a487a2

Address comments

25a9c5e

Fixing tests

3ba101b

Fixing tests

7dd6dfa

Fixing tests

3402571

mrchtr force-pushed the feature/remove-pipeline-interface branch from 0898170 to 3402571 Compare March 20, 2024 15:22

mrchtr added 7 commits March 21, 2024 08:42

Remove workspace dependecy from apply and write

90f9737

Fix integration test

6f937d8

Fixing pipeline usage in docs

6a15374

Fix docs

7c6d3ad

Fix docs

8a3e8d0

Fix docs

ca034c0

Fix docs

c83408b

GeorgesLorre approved these changes Mar 21, 2024

View reviewed changes

mrchtr merged commit f9238e2 into feature/refactore-pipeline-interface Mar 21, 2024
9 checks passed

mrchtr deleted the feature/remove-pipeline-interface branch March 21, 2024 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation new dataset interface #902

Implementation new dataset interface #902

mrchtr commented Mar 11, 2024

RobbeSneyders left a comment

GeorgesLorre commented Mar 18, 2024 •

edited

Loading

mrchtr commented Mar 18, 2024

mrchtr left a comment

mrchtr Mar 20, 2024

GeorgesLorre Mar 21, 2024

mrchtr Mar 20, 2024

mrchtr Mar 20, 2024

mrchtr Mar 20, 2024

GeorgesLorre Mar 21, 2024

mrchtr Mar 20, 2024

Implementation new dataset interface #902

Implementation new dataset interface #902

Conversation

mrchtr commented Mar 11, 2024

RobbeSneyders left a comment

Choose a reason for hiding this comment

GeorgesLorre commented Mar 18, 2024 • edited Loading

mrchtr commented Mar 18, 2024

mrchtr left a comment

Choose a reason for hiding this comment

mrchtr Mar 20, 2024

Choose a reason for hiding this comment

GeorgesLorre Mar 21, 2024

Choose a reason for hiding this comment

mrchtr Mar 20, 2024

Choose a reason for hiding this comment

mrchtr Mar 20, 2024

Choose a reason for hiding this comment

mrchtr Mar 20, 2024

Choose a reason for hiding this comment

GeorgesLorre Mar 21, 2024

Choose a reason for hiding this comment

mrchtr Mar 20, 2024

Choose a reason for hiding this comment

GeorgesLorre commented Mar 18, 2024 •

edited

Loading