Training Data Pipeline #10

lixu6-alt · 2024-10-30T06:14:12Z

Hi authors:

Thanks for your impressive work! Nowadays I am working on an idea of using the text instruction to guide the fusion of visual tokens, but I am confused of how to process the multi-turn conversations in the training set (like the samples in LLaVA-665K dataset). In the multi-turn case, you have multiple text instructions to deal with. I noticed that you have already used a text-guided router in the model architecture, so I am wondering how you deal with this issue? Do I have to split each multi-turn sample into multiple single-turn samples or there is a moe efficient way to work it out?

Thanks!

yfzhang114 · 2024-10-30T11:26:16Z

You only need to mask all the answers and system prompts, reserving all the questions as the query text. I think splitting each multi-turn sample into multiple single-turn samples is inefficiency. A intuitive example is that, if the user asked two questions about the image, we need to preserve the image details that related to all the questions.

lixu6-alt · 2024-10-31T02:48:58Z

@yfzhang114 Thanks for your timely response!! I get your thought, but I am concerning if the text encoder can effectively extract the semantic information from multiple questions at once. For esample, by checking the conversation contents in LLaVA-665K, I noticed that most of the questions do not relate to each other although they are corresponding to the same image. Currently, I have splitted the multi-turn samples of LLaVA-665K so the sample size have boosted from 665K to a number above 3200K, causing the training so inefficient. Still looking for a better solution :)

yfzhang114 · 2024-10-31T03:26:09Z

The text encoder is expected to handle multiple questions, with each question just consisting of keywords. It can accommodate a variety of keywords for enhanced processing. However, since this is still an experimental area, further testing and experimentation are necessary to validate these capabilities.

lixu6-alt · 2024-10-31T07:50:20Z

@yfzhang114 Yea, maybe pre-processing the instructions is necessary in this case. As you said, maybe we can use a LLM to extract the keywords from each instruction, then we integrate these keywords and put them to the text encoder just once.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Data Pipeline #10

Training Data Pipeline #10

lixu6-alt commented Oct 30, 2024

yfzhang114 commented Oct 30, 2024

lixu6-alt commented Oct 31, 2024

yfzhang114 commented Oct 31, 2024

lixu6-alt commented Oct 31, 2024

Training Data Pipeline #10

Training Data Pipeline #10

Comments

lixu6-alt commented Oct 30, 2024

yfzhang114 commented Oct 30, 2024

lixu6-alt commented Oct 31, 2024

yfzhang114 commented Oct 31, 2024

lixu6-alt commented Oct 31, 2024