-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation new dataset interface #902
Implementation new dataset interface #902
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mrchtr!
Not sure if going with a singleton is the best approach here. It can make testing harder and can become a bit too magic. We might still want to inject it into the Dataset
class instead.
Starting to look good @mrchtr I agree that that the workspace concept might still need some further refinement. How we instantiate it and how we link it to an execution but this already serving us. We could refine it in a follow up PR or it might still be mergable in the runners/compilers. The pipeline/Dataset changes are looking good, still some todo's but the mvp is here:
I'm ok with merging this in the other branch and focusing on green pipelines. Then we can add the missing functionality. |
Sounds like a plan! Going to fix the tests and then we can divide the work from there on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some notes to discuss the open topics. Feel free to have a look @GeorgesLorre. :)
raise InvalidWorkspaceDefinition(msg) | ||
return name | ||
|
||
def get_run_id(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the caching is currently working.
The run id is used to determine changes. The workspace should hold run ids in the future as well. I could think about a combination of runner type and working directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not put more effort into it for this PR. we can test and fix separately
|
||
def __init__( | ||
self, | ||
name: str, | ||
*, | ||
base_path: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename this into working_directory
as part of #887 to make the difference between Dataset
and Workspace
more clear.
A Dataset
hold all relevant information (e.g. name, description, manifest, potential metadata in the future).
A Workspace
is used to determine the execution environment (e.g. runner information, working directories).
@@ -440,30 +440,62 @@ def get_nested_dict_hash(input_dict): | |||
return get_nested_dict_hash(component_op_uid_dict) | |||
|
|||
|
|||
class Pipeline: | |||
"""Class representing a Fondant Pipeline.""" | |||
class Workspace: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do the clean up of this class as part of #887.
raise InvalidPipelineDefinition( | ||
msg, | ||
) | ||
# TODO: add method call to retrieve workspace context, and make passing workspace optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every operation method (read, apply, write) takes the workspace as an argument.
@GeorgesLorre I guess we can use the workspace initialised in the cli
to pass it somehow to these methods. Maybe as part of arguments? I would propose to clean this up as part of #887.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed not ideal. Ideally we inject these values at compile time. That way we can keep the datasets separate from the workspace.
# use workspace from cli command | ||
# if args.workspace exists | ||
|
||
workspace = getattr(args, "workspace", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added already the workspace to the cli. For now, the implementation is using a default workspace.
The run command loads the dataset object of the file (former pipeline) and executes the operations.
As part of #887 I would handle a proper default workspace initialisation, and we can think about refer to a workspace file (for instance .workspace/local.env containing information of environment and runner) or a definition inside the source code.
At least something like this fondant run local ... --workspace ./local.env
feels natural.
0898170
to
3402571
Compare
First steps for the implementation of the new dataset interface: - Removed Pipeline class - Added Workspace singleton to hold pipeline name, base_path, etc. .. (shouldn't be the focus of this PR) - Moved `Pipeline.read(..)` to Dataset class
First steps for the implementation of the new dataset interface:
Pipeline.read(..)
to Dataset class