Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataComp] Add T-MARS #374

Merged
merged 54 commits into from
Sep 7, 2023
Merged

[DataComp] Add T-MARS #374

merged 54 commits into from
Sep 7, 2023

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Aug 22, 2023

This PR adds the T-MARS pipeline to the Datacomp folder. Initial discussion at locuslab/T-MARS#3.

Screenshot 2023-08-31 at 11 11 19

To do:

  • remove commit hashes and replace with dev tag before merging

@NielsRogge NielsRogge changed the title [DataComp] Update simple pipeline [DataComp] Update pipelines Aug 22, 2023
@NielsRogge NielsRogge force-pushed the fix_simple_pipeline branch 3 times, most recently from a5dd774 to 71f705a Compare August 23, 2023 11:54
@RobbeSneyders
Copy link
Member

Do all the model and config files need to be included in the detect_text component?

@NielsRogge NielsRogge force-pushed the fix_simple_pipeline branch 2 times, most recently from 647c9ab to 131d27b Compare August 24, 2023 11:18
@NielsRogge NielsRogge changed the title [DataComp] Update pipelines [DataComp] Add text detection Aug 24, 2023
@NielsRogge NielsRogge mentioned this pull request Aug 24, 2023
5 tasks
@NielsRogge NielsRogge force-pushed the fix_simple_pipeline branch 2 times, most recently from 188562e to 76a0172 Compare August 24, 2023 13:58
Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NielsRogge!

Looks good overall but I think there are many added files for the text detection component that can be removed (will make review easier). Maybe let's stick to one model for now and only keeps the configs necessary for that one

scripts/build_components.sh Outdated Show resolved Hide resolved
scripts/build_components.sh Outdated Show resolved Hide resolved
@@ -32,6 +32,13 @@ produces:
clipl14score:
type: float32

textembedding:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether it won't be more intuitive to have the embeddings as a field rather than a subset. It makes more sense since different modalities (subsets) can have different embeddings, Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I know, however image data and image embeddings are both heavy in terms of data, which is probably why the subsets were created in the first place. So if both image data and embeddings are part of the same subset, then Fondant will first always load both first, and then only provide the embeddings if the user specified this, right?

cc @RobbeSneyders

@NielsRogge NielsRogge changed the title [DataComp] Add text detection [DataComp] Add T-MARS Aug 31, 2023
@NielsRogge NielsRogge force-pushed the fix_simple_pipeline branch 2 times, most recently from ae6df4a to 7f56da0 Compare September 1, 2023 13:58
Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Niels!

Left a few additional comments. Can you please revert back the changes to the Dockerfiles/images? Otherwise the CI/CD pipelines will fail to update them


# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=6f8d908c4231785afab8c2b3e87f66f9abf0452b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert to main


# Install Fondant
# This is split from other requirements to leverage caching
ARG FONDANT_VERSION=6f8d908c4231785afab8c2b3e87f66f9abf0452b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

node_pool_label="node_pool",
node_pool_name="model-inference-mega-pool",
number_of_gpus=1,
cache=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove cache=False now

scripts/build_components.sh Outdated Show resolved Hide resolved
@@ -16,8 +19,8 @@
pipeline_name="datacomp-filtering-pipeline",
pipeline_description="A pipeline for filtering the Datacomp dataset",
base_path=PipelineConfigs.BASE_PATH,
# base_path="/Users/nielsrogge/Documents/fondant_artifacts_datacomp",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed

@PhilippeMoussalli PhilippeMoussalli force-pushed the fix_simple_pipeline branch 2 times, most recently from 2fcc090 to 2b6e2c5 Compare September 7, 2023 08:52
@PhilippeMoussalli PhilippeMoussalli merged commit 82b5131 into main Sep 7, 2023
4 checks passed
@PhilippeMoussalli PhilippeMoussalli deleted the fix_simple_pipeline branch September 7, 2023 08:55
Hakimovich99 pushed a commit that referenced this pull request Oct 16, 2023
This PR adds the [T-MARS](https://tmars-clip.github.io/) pipeline to the
Datacomp folder. Initial discussion at
locuslab/T-MARS#3.

<img width="386" alt="Screenshot 2023-08-31 at 11 11 19"
src="https://github.com/ml6team/fondant/assets/48327001/a697121d-9851-4445-94ac-19b257719344">

To do:
- remove commit hashes and replace with dev tag before merging

---------

Co-authored-by: Philippe Moussalli <philippe.moussalli95@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants