Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation new dataset interface #902

Merged

Conversation

mrchtr
Copy link
Contributor

@mrchtr mrchtr commented Mar 11, 2024

First steps for the implementation of the new dataset interface:

  • Removed Pipeline class
  • Added Workspace singleton to hold pipeline name, base_path, etc. .. (shouldn't be the focus of this PR)
  • Moved Pipeline.read(..) to Dataset class

@mrchtr mrchtr requested review from RobbeSneyders and GeorgesLorre and removed request for RobbeSneyders March 11, 2024 15:21
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrchtr!

Not sure if going with a singleton is the best approach here. It can make testing harder and can become a bit too magic. We might still want to inject it into the Dataset class instead.

src/fondant/cli.py Outdated Show resolved Hide resolved
src/fondant/cli.py Outdated Show resolved Hide resolved
@GeorgesLorre
Copy link
Collaborator

GeorgesLorre commented Mar 18, 2024

Starting to look good @mrchtr

I agree that that the workspace concept might still need some further refinement. How we instantiate it and how we link it to an execution but this already serving us. We could refine it in a follow up PR or it might still be mergable in the runners/compilers.

The pipeline/Dataset changes are looking good, still some todo's but the mvp is here:

  • move the .read() method
  • add validation

I'm ok with merging this in the other branch and focusing on green pipelines. Then we can add the missing functionality.

@mrchtr
Copy link
Contributor Author

mrchtr commented Mar 18, 2024

Starting to look good @mrchtr

I agree that that the workspace concept might still need some further refinement. How we instantiate it and how we link it to an execution but this already serving us. We could refine it in a follow up PR or it might still be mergable in the runners/compilers.

The pipeline/Dataset changes are looking good, still some todo's but the mvp is here:

  • move the .read() method
  • add validation

I'm ok with merging this in the other branch and focusing on green pipelines. Then we can add the missing functionality.

Sounds like a plan! Going to fix the tests and then we can divide the work from there on.

Copy link
Contributor Author

@mrchtr mrchtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some notes to discuss the open topics. Feel free to have a look @GeorgesLorre. :)

raise InvalidWorkspaceDefinition(msg)
return name

def get_run_id(self) -> str:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the caching is currently working.
The run id is used to determine changes. The workspace should hold run ids in the future as well. I could think about a combination of runner type and working directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not put more effort into it for this PR. we can test and fix separately


def __init__(
self,
name: str,
*,
base_path: str,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this into working_directory as part of #887 to make the difference between Dataset and Workspace more clear.
A Dataset hold all relevant information (e.g. name, description, manifest, potential metadata in the future).
A Workspace is used to determine the execution environment (e.g. runner information, working directories).

@@ -440,30 +440,62 @@ def get_nested_dict_hash(input_dict):
return get_nested_dict_hash(component_op_uid_dict)


class Pipeline:
"""Class representing a Fondant Pipeline."""
class Workspace:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do the clean up of this class as part of #887.

raise InvalidPipelineDefinition(
msg,
)
# TODO: add method call to retrieve workspace context, and make passing workspace optional
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every operation method (read, apply, write) takes the workspace as an argument.
@GeorgesLorre I guess we can use the workspace initialised in the cli to pass it somehow to these methods. Maybe as part of arguments? I would propose to clean this up as part of #887.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed not ideal. Ideally we inject these values at compile time. That way we can keep the datasets separate from the workspace.

# use workspace from cli command
# if args.workspace exists

workspace = getattr(args, "workspace", None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added already the workspace to the cli. For now, the implementation is using a default workspace.
The run command loads the dataset object of the file (former pipeline) and executes the operations.

As part of #887 I would handle a proper default workspace initialisation, and we can think about refer to a workspace file (for instance .workspace/local.env containing information of environment and runner) or a definition inside the source code.
At least something like this fondant run local ... --workspace ./local.env feels natural.

@mrchtr mrchtr force-pushed the feature/remove-pipeline-interface branch from 0898170 to 3402571 Compare March 20, 2024 15:22
@mrchtr mrchtr merged commit f9238e2 into feature/refactore-pipeline-interface Mar 21, 2024
9 checks passed
@mrchtr mrchtr deleted the feature/remove-pipeline-interface branch March 21, 2024 10:43
GeorgesLorre pushed a commit that referenced this pull request Apr 4, 2024
First steps for the implementation of the new dataset interface: 
- Removed Pipeline class
- Added Workspace singleton to hold pipeline name, base_path, etc. ..
(shouldn't be the focus of this PR)
- Moved `Pipeline.read(..)` to Dataset class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants