Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use from_registry for generic components #285

Merged
merged 4 commits into from
Jul 18, 2023

Conversation

RobbeSneyders
Copy link
Member

Fixes #251, #252, #253

Before this PR, the operation for a generic component (a reusable component that dynamically takes a component specification), needed to be created as follows:

component_op = ComponentOp.from_registry(
    name="generic_component",
    component_spec_path="components/custom_generic_component/fondant_component.yaml",
)

But the provided name wasn't actually used, since the component specification already contains a reference to the reusable image that it should use. Now we can define both custom and generic components as follows:

component_op = ComponentOp(
    component_dir="components/custom_generic_component",
)

There is still a difference in how we want to handle custom and generic components though. Custom components should be built by the local runner, while generic components should not, since they use a reusable image. To make this differentiation, we now simply check if a Dockerfile is present in the provided Component_dir. This will be the case for a custom component, but not for a generic component.


This has a nice implicit result for reusable components as well, which can still be defined as:

component_op = ComponentOp.from_registry(
    name="reusable_component",
)

Which fondant now resolves to:

component_op = ComponentOp(
    component_dir="{fondant_install_dir}/components/custom_generic_component",
)

I added a change to this PR which no longer packages the reusable component code with the fondant package, but only the component specifications, as those are all that is needed since they contain a reference to the reusable image on the registry. This means that the component_dir above doesn't contain a Dockerfile when fondant is installed from PyPI, but does when you locally install fondant using poetry install. So the local runner doesn't build reusable components when users install fondant from PyPI, but it does when you're working on the fondant repo, which is useful for us fondant developers.


The only thing we still need is an option on the runner to provide build_args, so we can pass in the fondant version to build into the image. But I'll open a separate PR for that.

@@ -9,5 +9,5 @@ root_path=$(dirname "$scripts_path")

pushd "$root_path"
rm -rf src/fondant/components
cp -r components/ src/fondant/
find components/ -type f | grep -i yaml$ | xargs -i cp --parents {} src/fondant/
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only copy the component specifications and keep the same structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you validate this on mac @GeorgesLorre?

"build"
] = f"./{Path(component_op.component_spec_path).parent}"
if component_op.dockerfile_path is not None:
services[safe_component_name]["build"] = str(component_op.component_dir)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation failed for absolute paths, this works for both absolute and relative paths.

assert (
load_from_hub_custom_op.component_spec
== load_from_hub_default_op.component_spec
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these checks were trivial but harder to replicate, so only kept the one that is non-trivial.

self.component_spec.specification,
)
def dockerfile_path(self) -> t.Optional[Path]:
path = self.component_dir / "Dockerfile"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dockerfile is the default and should be used but I have seen people use dockerfile before (and it also works) so maybe we should also check for that ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have to, since we can expect these components to be implemented for fondant specifically. So we can impose our own expectations. I don't think that's a bad thing to do here, since we adhere to the standard and keep things simple.

@GeorgesLorre
Copy link
Collaborator

Did you check the docs for needed changes ? I think at least the getting-started might need some updates aswel.

Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Robbe!

I just have a few doubts about whether this can be clear for the end user. Right now we have:

  1. Custom components
ComponentOp(component_dir="")
  1. Resuable components
ComponentOp.from_registry(name="")
  1. Generic components

Can be considered to be a hybrid of custom component (require a customized component spec) and a reusable component (images are already built and available in the registry).

However, right now it seems like we're defining them as if they're fully custom

ComponentOp(component_dir="")

At first glance, it seems like the directory should contain all the component files (src_code, Dockerfile, ...) similar to the custom components where they are used to build the images for the local runner. But in fact, we're just looking for the spec file within the specified directory.

Wondering whether it wouldn't better to have generic components defined more explicitly by just passing in the component spec path since the users will anyway have to create a custom component spec for that component:

ComponentOp.from_spec(component_spec_path="")

I think it would be nice to have separate constructors for each component type, Wdyt?

@RobbeSneyders
Copy link
Member Author

Did you check the docs for needed changes ? I think at least the getting-started might need some updates aswel.

Not yet, I'll have a look.


I think it would be nice to have separate constructors for each component type, Wdyt?

I partially agree. I agree that we need to make sure it is clear to the user, but on the other hand, there's not really that big of a difference between a custom and a generic component. Both only need a component specification referring to an existing image to work.

You could for instance define a custom component as follows and it would work as well:

|- my_image
|  |- src
|  |  |- main.py
|  |- Dockerfile
|- components
|  |- custom_component
|     |- fondant_component.yaml
|        "image: my_image:tag"
|- pipeline.py
   "component_op = ComponentOp.from_spec(path="components/custom_component/fondant_component.yaml")

So while splitting the constructors kind of makes sense logically, it actually creates two ways to achieve the same thing. You could use both constructors to create either a custom or a generic component.

component_op(component_dir="components/custom_component")
component_op.from_spec(component_spec_path="components/custom_component/fondant_component.yaml")

The only functionality for which we need more than the component specification, is to automatically build the images, which we currently only do in the local runner.
(I do think it would be nice to extend this behavior to other runners as well in the future though, and possibly even provide a fondant build command to help users build their components.)

That's the only reason why it might make sense to split the constructors, but this doesn't really matter for your pipeline definition, only execution. So I would move this responsibility to the runner and let it decide when to build (based on presence of a Dockerfile, command flags, ...).

So I would select only one of the two interfaces, and then I see the following (dis)advantages:

ComponentOp(component_dir="components/custom_component") ComponentOp.from_spec("components/custom_component/fondant_component.yaml")
Less clear that users can use a generic component by only defining a fondant_component.yaml More clear that only the fondant_component.yaml is needed to create a custom or generic component
No need to add fondant_component.yaml since the name is fixed Defining fondant_component.yaml each time seems superfluous, so then we might not want to keep it fixed

I don't have a strong preference, but since we have to choose, I would go with:

componentOp(component_dir="components/custom_component")

and clearly document the expected structure of the component_dir:

|- fondant_component.yaml (required)
|- Dockerfile (optional)

(All the other files are actually don't matter to fondant at the moment, but are only used in the Dockerfile)

Happy to hear other opinions.

@RobbeSneyders RobbeSneyders force-pushed the feature/component-op-dir branch from 966cc5a to 2f93a47 Compare July 17, 2023 23:03
@GeorgesLorre
Copy link
Collaborator

Thanks for the detailed explanation and options:

I suggest we go for (also not a strong preference)
componentOp(component_dir="components/custom_component")

The key will be documenting it very well and maybe expanding the logging with
Found Dockerfile for component x: building image

Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Robbe! Looks good overall and the documentation is clear.

Just a few minor things to adjust in the documentation

)
```

??? "fondant.pipeline.ComponentOp.from_registry"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are those meant to end up here? there are quite a few of them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can see the result in the built documentation:
https://fondant--285.org.readthedocs.build/en/285/components/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks neat ✨

docker image. The specification defines the docker image fondant should run to execute the
component, which data it consumes and produces, and which arguments it takes.

## Component types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the descriptions, quite clear

from fondant.pipeline import ComponentOp

component_op = ComponentOp(
component_dir="components/custom_component",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

custom_component-> generic_component

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

@RobbeSneyders RobbeSneyders merged commit a0fa618 into main Jul 18, 2023
@RobbeSneyders RobbeSneyders deleted the feature/component-op-dir branch July 18, 2023 08:59
satishjasthi pushed a commit to satishjasthi/fondant that referenced this pull request Jul 21, 2023
Fixes ml6team#251, ml6team#252, ml6team#253

Before this PR, the operation for a generic component (a reusable
component that dynamically takes a component specification), needed to
be created as follows:

```python
component_op = ComponentOp.from_registry(
    name="generic_component",
    component_spec_path="components/custom_generic_component/fondant_component.yaml",
)
```

But the provided name wasn't actually used, since the component
specification already contains a reference to the reusable image that it
should use. Now we can define both custom and generic components as
follows:

```python
component_op = ComponentOp(
    component_dir="components/custom_generic_component",
)
```

There is still a difference in how we want to handle custom and generic
components though. Custom components should be built by the local
runner, while generic components should not, since they use a reusable
image. To make this differentiation, we now simply check if a
`Dockerfile` is present in the provided `Component_dir`. This will be
the case for a custom component, but not for a generic component.

---

This has a nice implicit result for reusable components as well, which
can still be defined as:

```python
component_op = ComponentOp.from_registry(
    name="reusable_component",
)
```

Which `fondant` now resolves to:
```python
component_op = ComponentOp(
    component_dir="{fondant_install_dir}/components/custom_generic_component",
)
```

I added a change to this PR which no longer packages the reusable
component code with the `fondant` package, but only the component
specifications, as those are all that is needed since they contain a
reference to the reusable image on the registry. This means that the
`component_dir` above doesn't contain a `Dockerfile` when `fondant` is
installed from PyPI, but does when you locally install `fondant` using
`poetry install`. So the local runner doesn't build reusable components
when users install `fondant` from PyPI, but it does when you're working
on the `fondant` repo, which is useful for us `fondant` developers.

---

The only thing we still need is an option on the runner to provide
`build_args`, so we can pass in the `fondant` version to build into the
image. But I'll open a separate PR for that.
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
Fixes #251, #252, #253

Before this PR, the operation for a generic component (a reusable
component that dynamically takes a component specification), needed to
be created as follows:

```python
component_op = ComponentOp.from_registry(
    name="generic_component",
    component_spec_path="components/custom_generic_component/fondant_component.yaml",
)
```

But the provided name wasn't actually used, since the component
specification already contains a reference to the reusable image that it
should use. Now we can define both custom and generic components as
follows:

```python
component_op = ComponentOp(
    component_dir="components/custom_generic_component",
)
```

There is still a difference in how we want to handle custom and generic
components though. Custom components should be built by the local
runner, while generic components should not, since they use a reusable
image. To make this differentiation, we now simply check if a
`Dockerfile` is present in the provided `Component_dir`. This will be
the case for a custom component, but not for a generic component.

---

This has a nice implicit result for reusable components as well, which
can still be defined as:

```python
component_op = ComponentOp.from_registry(
    name="reusable_component",
)
```

Which `fondant` now resolves to:
```python
component_op = ComponentOp(
    component_dir="{fondant_install_dir}/components/custom_generic_component",
)
```

I added a change to this PR which no longer packages the reusable
component code with the `fondant` package, but only the component
specifications, as those are all that is needed since they contain a
reference to the reusable image on the registry. This means that the
`component_dir` above doesn't contain a `Dockerfile` when `fondant` is
installed from PyPI, but does when you locally install `fondant` using
`poetry install`. So the local runner doesn't build reusable components
when users install `fondant` from PyPI, but it does when you're working
on the `fondant` repo, which is useful for us `fondant` developers.

---

The only thing we still need is an option on the runner to provide
`build_args`, so we can pass in the `fondant` version to build into the
image. But I'll open a separate PR for that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't use from_registry for generic components
3 participants