Vision/Multimodal #791

bhack · 2024-04-18T12:56:07Z

With all the growing activity and focus on multimodal models is this library restricted to tune text only LLM?
Do we plan to have Vision or more in general multimodal models tuning support?

ebsmothers · 2024-04-18T13:42:05Z

Hi @bhack thanks for the question! We haven't added any multimodal models yet as we are working to get good coverage of text-only methods first, but it's definitely something we are considering for the future. Out of curiosity, are there any multimodal models or techniques you'd be interested in seeing specifically?

bhack · 2024-04-18T14:10:27Z

There was a recent and interesting survey at:
https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey

bhack · 2024-04-18T14:19:06Z

About my personal preference I would like to effectively fine-tuning with a support library like this one models like (or similar):
https://github.com/FoundationVision/GLEE

matbee-eth · 2024-04-23T06:02:43Z

Moondream2 would be a good one to begin with! It uses Phi-2 and forked siglip as a projector

ebsmothers · 2024-05-01T03:34:30Z

Thanks @bhack and @matbee-eth for the suggestions! At this exact moment we do not have the bandwidth to take these on but we will keep them both in mind for the near future (re Moondream2 we are currently working on adding Phi-3). In the meantime, please let us know if either of you would be willing to contribute on this front.

matbee-eth · 2024-05-01T18:32:23Z

Thanks @bhack and @matbee-eth for the suggestions! At this exact moment we do not have the bandwidth to take these on but we will keep them both in mind for the near future (re Moondream2 we are currently working on adding Phi-3). In the meantime, please let us know if either of you would be willing to contribute on this front.

Do you know of any PR's that cover end-to-end implementation details for doing such a thing? Just to assess whether it takes novel-work or if its just conforming to some sort of protocol/design.

bhack · 2024-07-30T17:15:15Z

Also fine-tuning other Meta foundational models would be nice like recent https://github.com/facebookresearch/segment-anything-2

joecummings · 2024-07-30T18:48:03Z

Also fine-tuning other Meta foundational models would be nice like recent facebookresearch/segment-anything-2

Thanks for the input! We're still a small team so working hard to provide great memory savings and performance for LLMs first, but this is 100% on our radar.

Just out of curiosity - what kind of finetuning would you want to do with SAM2? Do you have any hard data or HW constraints?

bhack · 2024-07-30T18:59:31Z

Yes generally it could be hard mining cases, highres, HW constrains etc.. So I think that also in Vision we really have the same type of fine-tuning needs.
I really hope that we could share some common infra/components between LLM, Vision and Multimodal without building 3 different frameworks.
But this will really depend on how well torctune could abstract some concepts.
Also if I understand you are prioritizing LLM I think that a Multimodal/Vision model could be useful as an early Canary test to lowering the risk of a more heavy refractory on a later stage.
I think that you can ask the collaboration of some Vision/multimodal teams internally to create more critical mass around the project.

bhack · 2024-07-30T19:02:54Z

E.g. see how many comments we had on the original SAM just related to fine-tuning:
facebookresearch/segment-anything#5

bhack · 2024-07-30T19:07:43Z

Also just to make another example. Your WIP RLHF with PPO #1005 or other approaches like that could be still useful in Vision/Multimodal https://encord.com/blog/guide-to-rlhf/

So I think this is why it is important to have some canary tests on other domains to better validate the design.

bhack · 2024-11-06T15:02:14Z

@RdoubleA Where we could track the Vision part?

RdoubleA · 2024-11-06T15:52:49Z

We currently support multimodal models (vision + text), you can take a look at Llama3.2 Vision: https://github.com/pytorch/torchtune/blob/main/torchtune/models/llama3_2_vision/_model_builders.py which uses CLIP image encoders.

The components used here can be used to add future multimodal models, especially for DeepFusion (https://github.com/pytorch/torchtune/blob/main/torchtune/modules/model_fusion/_fusion.py). I am currently working on enabling EarlyFusion here: #1904. This would support models like Llava.

Is there any model or feature you are particularly interested in?

bhack · 2024-11-10T23:07:13Z

SAM2 now has official training code from META. Do you think we could have a pure vision model like this?

felipemello1 · 2024-11-11T13:56:22Z

Hey @bhack , thanks for asking! Currently we don't have it in your roadmap :/. Our main focus is LLM, including multimodality. It is possible that, as we build for multimodality some vision models will become natively supported. But SAM2 is not in our plans just yet.

bhack · 2024-11-13T17:08:50Z

Ok any visual only model could be fine if you could include it in the roadmap.

anilbatra2185 · 2024-12-10T15:45:51Z

hi @RdoubleA,

thanks for including the clip model and LLama3.2-Vision models. I wonder if there is any plan to include SigLip model in near future, as I intend to work on it.

thanks

pbontrager · 2024-12-10T16:27:59Z

Thanks fro your interest in our vision capabilities! SigLIP is an important model for us to onboard as it's the primary alternative to CLIP for popular VLMs currently. We will likely add SigLIP during the upcoming months but it's not top priority as there are a number of VLMs we can onboard with CLIP first.

If SigLIP is very important for you we can take that into consideration and we would also welcome you to contribute SigLIP to the codebase if you're able and willing too. It would look fairly similar to the CLIP image model code.

anilbatra2185 · 2024-12-10T16:44:45Z

Thanks fro your interest in our vision capabilities! SigLIP is an important model for us to onboard as it's the primary alternative to CLIP for popular VLMs currently. We will likely add SigLIP during the upcoming months but it's not top priority as there are a number of VLMs we can onboard with CLIP first.

If SigLIP is very important for you we can take that into consideration and we would also welcome you to contribute SigLIP to the codebase if you're able and willing too. It would look fairly similar to the CLIP image model code.

thank you @pbontrager for confirming. I am happy to work on SigLIP model and will let you know once it is completed.

pbontrager · 2024-12-10T17:10:26Z

Thank you that sounds great! Once you're ready to start on it could you share which siglip version you're building? If you're doing just the vision portion of the model, you should be able to implement model/siglip with just _transform.py, _component_builders.py, _convert_weights.py, and _model_builders.py files using the clip folder a a blueprint.

felipemello1 · 2024-12-10T18:45:53Z

@anilbatra2185 , thanks for looking into this! To avoid rewriting code, you may want to write a small RFC (request for comments) PR, with a basic skeleton of the classes and dummy code (not functional). For example, you may need to change the current transformer API and discussing it before writing your whole code may save you some time.

Example:

path/to/folder_a:
  |- my_function.py
  	class_a
  	class_b
 |- my_other_function.py
     util_1
     util_2

Class A:
	def method1(arg1, new_arg_2):
		...
	def method 2
		...

anilbatra2185 · 2024-12-10T18:53:19Z

thanks @pbontrager for sharing the information on basic files need to change. I am interested to work on google/siglip-base-patch16-256 and google/siglip-base-patch16-256-multilingual models. However, I will start with basic model (earlier one).

thanks @felipemello1 for suggestion and it is helpful to plan before implementation. I will share the RFC soon.

RdoubleA mentioned this issue Aug 21, 2024

Tracker for Multimodal requests #1383

Closed

RdoubleA closed this as completed Nov 6, 2024

RdoubleA reopened this Nov 6, 2024

joecummings added the enhancement New feature or request label Dec 10, 2024

joecummings assigned pbontrager Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision/Multimodal #791

Vision/Multimodal #791

bhack commented Apr 18, 2024

ebsmothers commented Apr 18, 2024

bhack commented Apr 18, 2024

bhack commented Apr 18, 2024

matbee-eth commented Apr 23, 2024 •

edited

Loading

ebsmothers commented May 1, 2024

matbee-eth commented May 1, 2024

bhack commented Jul 30, 2024

joecummings commented Jul 30, 2024

bhack commented Jul 30, 2024 •

edited

Loading

bhack commented Jul 30, 2024

bhack commented Jul 30, 2024 •

edited

Loading

bhack commented Nov 6, 2024

RdoubleA commented Nov 6, 2024

bhack commented Nov 10, 2024

felipemello1 commented Nov 11, 2024 •

edited

Loading

bhack commented Nov 13, 2024

anilbatra2185 commented Dec 10, 2024

pbontrager commented Dec 10, 2024

anilbatra2185 commented Dec 10, 2024

pbontrager commented Dec 10, 2024

felipemello1 commented Dec 10, 2024 •

edited

Loading

anilbatra2185 commented Dec 10, 2024

Vision/Multimodal #791

Vision/Multimodal #791

Comments

bhack commented Apr 18, 2024

ebsmothers commented Apr 18, 2024

bhack commented Apr 18, 2024

bhack commented Apr 18, 2024

matbee-eth commented Apr 23, 2024 • edited Loading

ebsmothers commented May 1, 2024

matbee-eth commented May 1, 2024

bhack commented Jul 30, 2024

joecummings commented Jul 30, 2024

bhack commented Jul 30, 2024 • edited Loading

bhack commented Jul 30, 2024

bhack commented Jul 30, 2024 • edited Loading

bhack commented Nov 6, 2024

RdoubleA commented Nov 6, 2024

bhack commented Nov 10, 2024

felipemello1 commented Nov 11, 2024 • edited Loading

bhack commented Nov 13, 2024

anilbatra2185 commented Dec 10, 2024

pbontrager commented Dec 10, 2024

anilbatra2185 commented Dec 10, 2024

pbontrager commented Dec 10, 2024

felipemello1 commented Dec 10, 2024 • edited Loading

anilbatra2185 commented Dec 10, 2024

matbee-eth commented Apr 23, 2024 •

edited

Loading

bhack commented Jul 30, 2024 •

edited

Loading

bhack commented Jul 30, 2024 •

edited

Loading

felipemello1 commented Nov 11, 2024 •

edited

Loading

felipemello1 commented Dec 10, 2024 •

edited

Loading