Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add getting started documentation #250

Merged
merged 5 commits into from
Jul 3, 2023
Merged

Conversation

GeorgesLorre
Copy link
Collaborator

No description provided.

)

my_pipeline.add_op(load_from_hf_hub, dependencies=[])
```
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is the best example since it seems quite complex

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might just have to explain it a bit better, as it is the most useful component to easily load some data to get started. Not sure if we want to describe this as a third kind of component, which is a "generic" component.

docs/getting_started.md Outdated Show resolved Hide resolved
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeorgesLorre! don't forget to add the file to the documentation index.

docs/getting_started.md Outdated Show resolved Hide resolved
docs/getting_started.md Outdated Show resolved Hide resolved
docs/getting_started.md Outdated Show resolved Hide resolved
docs/getting_started.md Outdated Show resolved Hide resolved
docs/getting_started.md Outdated Show resolved Hide resolved
)

my_pipeline.add_op(load_from_hf_hub, dependencies=[])
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might just have to explain it a bit better, as it is the most useful component to easily load some data to get started. Not sure if we want to describe this as a third kind of component, which is a "generic" component.


## Running your pipeline

A Fondant pipeline needs to be compiled before it can be ran. This means translating the user friendly Fondant pipeline definition into something that can be executed by a runner.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave out the compilation step here and only explain the combined functionality which you get when using fondant run.

pyproject.toml Outdated
Comment on lines 49 to 55
kfp = { version = ">= 1.8.19"}

kfp = { version = ">= 1.8.19", optional = true }
kubernetes = { version = ">= 18.20.0", optional = true }
pandas = { version = ">= 1.3.5", optional = true }

[tool.poetry.extras]
pipelines = ["kfp", "kubernetes"]
pipelines = ["kubernetes"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for these changes? kfp is quite a big dependency which you don't need when installing fondant in a component, which is why we added it as an optional dependency to the pipeline extra.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kfp is used by the pipeline.py so it is needed for fondant compile and fondant run

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so then the user needs to install fondant[pipelines] (fondant[kfp] might be better). But we don't want to install this in every component.


Note that if you use a local `base_path` in your pipeline declaration that this path will be mounted in the docker containers. This means that the data will be stored locally on your machine. If you use a cloud storage path, the data will be stored in the cloud.

Now that we have compiled our pipeline, we can run it:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should mentioned that the docker images are stored in the Github container registry. I didn't pulled images from there before which leads me into an additional docker login step.
Basically following the steps here: https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry#authenticating-to-the-container-registry

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to login when the images are public ? Or only for the private ones ?

@GeorgesLorre GeorgesLorre changed the title Add first part of getting started documentation Add getting started documentation Jul 3, 2023
Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @GeorgesLorre, some minor comments, but otherwise looks good!

README.md Outdated
@@ -85,6 +85,11 @@ Eg. generating logos:

<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>

## 💨 Getting Started
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this before the example pipelines

dataframe[[("images", "data")]].map(extract_dimensions)
dataframe[[("images", "width"), ("images", "height")]] = dataframe[
[("images", "data")]
].apply(lambda x: extract_dimensions(x.iloc[0]), axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the column name here instead of the index? That would make it easier to understand.


my_pipeline = Pipeline(
pipeline_name='my_pipeline',
base_path='/home/username/my_pipeline', <--- Make sure to update this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
base_path='/home/username/my_pipeline', <--- Make sure to update this
base_path='/home/username/my_pipeline', # TODO: update this


Now that we have a pipeline, we can add components to it. Components are the building blocks of your pipeline. They are the individual steps that will be executed in your pipeline. There are 2 main types of components:

- reusable components: These are components that are already created by the community and can be easily used in your pipeline. You can find a list of reusable components [here](https://github.com/ml6team/fondant/tree/main/components). They often have arguments that you can set to configure them for your use case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- reusable components: These are components that are already created by the community and can be easily used in your pipeline. You can find a list of reusable components [here](https://github.com/ml6team/fondant/tree/main/components). They often have arguments that you can set to configure them for your use case.
- **reusable components**: These are components that are already created by the community and can be easily used in your pipeline. You can find a list of reusable components [here](https://github.com/ml6team/fondant/tree/main/components). They often have arguments that you can set to configure them for your use case.


- reusable components: These are components that are already created by the community and can be easily used in your pipeline. You can find a list of reusable components [here](https://github.com/ml6team/fondant/tree/main/components). They often have arguments that you can set to configure them for your use case.

- custom components: These are the components you create to solve your use case. A custom component can be easily created by adding a `fondant_component.yaml`, `dockerfile` and `main.py` file to your component subdirectory. The `fondant_component.yaml` file contains the specification of your component. You can find more information about it [here](https://github.com/ml6team/fondant/blob/main/docs/component_spec.md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- custom components: These are the components you create to solve your use case. A custom component can be easily created by adding a `fondant_component.yaml`, `dockerfile` and `main.py` file to your component subdirectory. The `fondant_component.yaml` file contains the specification of your component. You can find more information about it [here](https://github.com/ml6team/fondant/blob/main/docs/component_spec.md)
- **custom components**: These are the components you create to solve your use case. A custom component can be easily created by adding a `fondant_component.yaml`, `dockerfile` and `main.py` file to your component subdirectory. The `fondant_component.yaml` file contains the specification of your component. You can find more information about it [here](https://github.com/ml6team/fondant/blob/main/docs/component_spec.md)

logger.info("Filtering dataset...")

dataframe[[("images", "width"), ("images", "height")]] = \
dataframe[[("images", "data")]].apply(lambda x:extract_dimensions(x.iloc[0]), axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@GeorgesLorre GeorgesLorre merged commit 2009870 into main Jul 3, 2023
@GeorgesLorre GeorgesLorre deleted the feature/getting-started-docs branch July 3, 2023 19:32
@RobbeSneyders RobbeSneyders linked an issue Jul 4, 2023 that may be closed by this pull request
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create 'Getting Started' documentation
4 participants