Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alpha pipeline] Add image retrieval component #110

Merged
merged 15 commits into from
May 11, 2023
Merged

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented May 10, 2023

This PR adds the image retrieval component, which retrieves image URLs using clip-retrieval from LAION-5B.

Fixes #95

Comment on lines +117 to +126
# TODO remove, just use a tiny df for testing purposes
data = {
"prompts_text": [
"comfortable bathroom, art deco interior design",
"comfortable bathroom, bauhaus interior design",
]
}
pandas_df = pd.DataFrame.from_dict(data)
df = dd.from_pandas(pandas_df, npartitions=1)
# end of TODO
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would keep this as long as we're building the pipeline. Would remove once the pipeline is finished

@@ -1,2 +1,2 @@
git+https://github.com/ml6team/fondant.git
git+https://github.com/ml6team/fondant.git@2abeabe07412266b78e3b2a4055e0b10d62168cb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guilty of this myself but don't forget to remove this

@NielsRogge NielsRogge requested a review from j-branigan May 10, 2023 14:46
Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Niels :) Looks good! I left a few comments.
How did you manage to resolve the issues with slow retrieval eventually?

@@ -140,7 +141,9 @@ def write_index(self, df: dd.DataFrame):
)

# Write index
dd.compute(upload_index_task)
with ProgressBar():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, something similar to tqdm with an ETA? does it show up nicely in the kfp logging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this:

Screenshot 2023-05-10 at 17 38 17

It's not ideal as it seems to replicate the progress bars across cores? Not sure. Do we know how many cores the machines on Kubeflow have?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will depend on the machine defined for the component, but often quite a few I think. And also multiple machines in the future.

@@ -0,0 +1,2 @@
git+https://github.com/ml6team/fondant.git@2abeabe07412266b78e3b2a4055e0b10d62168cb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NielsRogge!

@@ -140,7 +141,9 @@ def write_index(self, df: dd.DataFrame):
)

# Write index
dd.compute(upload_index_task)
with ProgressBar():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will depend on the machine defined for the component, but often quite a few I think. And also multiple machines in the future.

@RobbeSneyders
Copy link
Member

As discussed before, I would make these reusable components available in a components directory directly under the root directory. We can merge this one as is though, as we will have to provide a mechanism to easily load the reusable components. We can implement this using this component as an example in a separate PR.

# end of TODO

# add id and source columns
df["id"] = df.assign(id=1).id.cumsum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the LAION ids here.

Copy link
Contributor Author

@NielsRogge NielsRogge May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I leave this for a follow-up PR? Would like to first get all components up and running.

There's no real benefit of having the LAION id's over integer indices that go from 0 to length of the dataset for this pipeline

@NielsRogge NielsRogge merged commit 0b3c267 into main May 11, 2023
@RobbeSneyders RobbeSneyders deleted the add_image_retrieval branch May 15, 2023 16:28
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR adds the image retrieval component, which retrieves image URLs
using clip-retrieval from LAION-5B.

Fixes #95

---------

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prompt-based LAION retrieval component
4 participants