Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Pipeline interface #886

Closed
Tracked by #853
mrchtr opened this issue Feb 29, 2024 · 2 comments
Closed
Tracked by #853

Refactor Pipeline interface #886

mrchtr opened this issue Feb 29, 2024 · 2 comments
Assignees
Labels
Core Core framework

Comments

@mrchtr
Copy link
Contributor

mrchtr commented Feb 29, 2024

No description provided.

@GeorgesLorre GeorgesLorre changed the title Combine pipeline and dataset class Refactor Pipeline interface Mar 8, 2024
@GeorgesLorre
Copy link
Collaborator

GeorgesLorre commented Mar 8, 2024

With Fondant we want to become Dataset focused this means that the current way of defining a Fondant workflow could use some changes to promote this:

This is the current way:

# Define pipeline
pipeline = Pipeline(
    name="foo", description="bar", base_path="baz")
)

# Register components
data1 = pipeline.read(a_component)
data2 = data1.apply(a_different_component)

Ideally we want to go to something like this:

# Register components
data1 = Dataset.read("some_ref_to_a_manifest")
data2 = data1.apply(a_different_component)

Where you get new datasets by applying operations on existing datasets.

The current Pipeline class has a couple of responsibilities:

  • Register the base_path
  • Register the pipeline_name
  • Register the pipeline_description
  • Entrypoint for the first operation
  • Hold the graph of operations
  • Sort and validate graph
  • Do validation on pipeline

We need to redistribute these responsibilities if we want to remove the pipeline interface.

Move to the Compiler/Runner:

  • the pipeline name (already happening for Sagemaker since we can't store the pipeline name in the compiled pipeline spec) and description.
  • The base_path could be moved here. the base path and runner are often related: GCS for Vertex, S3 for Sagemaker etc. Maybe in the future we want to revise this since the base_path might become more important if we focus on sharing intermediate datasets. Maybe the concept of a workspace (Add workspace context #887) might solve this or a default base_path with a per dataset optional override.

Move to the Dataset:

  • the entrypoint. We can start by initialising a Dataset instance. This can take multiple forms. Related to Initiliase dataset from previous Fondant run #885
  • All graph and validation logic: this is still something to experiment with since we won't have a central place anymore to manage the graph which brings benefits in flexibility but has downsides in the extra complexity.

The different compilers/runners will then need to work with a dataset as input we will also need logic here to build the correct graph of operations and translate it into the runner specific pipeline spec.

@mrchtr mrchtr self-assigned this Mar 11, 2024
@mrchtr mrchtr moved this from Backlog to In Progress in Fondant development Mar 11, 2024
@mrchtr mrchtr added the Core Core framework label Mar 11, 2024
@mrchtr
Copy link
Contributor Author

mrchtr commented Mar 19, 2024

As discussed offline with @GeorgesLorre, first step is to merge the Pipeline and Dataset class.
We will incorporate a temporary Workspace to guarantee the core code is working.
This will be tackle in #902

@mrchtr mrchtr moved this from In Progress to Validation in Fondant development Mar 26, 2024
@mrchtr mrchtr moved this from Validation to Done in Fondant development Mar 27, 2024
@mrchtr mrchtr closed this as completed Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core Core framework
Projects
Archived in project
Development

No branches or pull requests

2 participants