Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Data Pipeline #10

Open
lixu6-alt opened this issue Oct 30, 2024 · 4 comments
Open

Training Data Pipeline #10

lixu6-alt opened this issue Oct 30, 2024 · 4 comments

Comments

@lixu6-alt
Copy link

Hi authors:

Thanks for your impressive work! Nowadays I am working on an idea of using the text instruction to guide the fusion of visual tokens, but I am confused of how to process the multi-turn conversations in the training set (like the samples in LLaVA-665K dataset). In the multi-turn case, you have multiple text instructions to deal with. I noticed that you have already used a text-guided router in the model architecture, so I am wondering how you deal with this issue? Do I have to split each multi-turn sample into multiple single-turn samples or there is a moe efficient way to work it out?

Thanks!

@yfzhang114
Copy link
Owner

You only need to mask all the answers and system prompts, reserving all the questions as the query text. I think splitting each multi-turn sample into multiple single-turn samples is inefficiency. A intuitive example is that, if the user asked two questions about the image, we need to preserve the image details that related to all the questions.

@lixu6-alt
Copy link
Author

@yfzhang114 Thanks for your timely response!! I get your thought, but I am concerning if the text encoder can effectively extract the semantic information from multiple questions at once. For esample, by checking the conversation contents in LLaVA-665K, I noticed that most of the questions do not relate to each other although they are corresponding to the same image. Currently, I have splitted the multi-turn samples of LLaVA-665K so the sample size have boosted from 665K to a number above 3200K, causing the training so inefficient. Still looking for a better solution :)

@yfzhang114
Copy link
Owner

The text encoder is expected to handle multiple questions, with each question just consisting of keywords. It can accommodate a variety of keywords for enhanced processing. However, since this is still an experimental area, further testing and experimentation are necessary to validate these capabilities.

@lixu6-alt
Copy link
Author

@yfzhang114 Yea, maybe pre-processing the instructions is necessary in this case. As you said, maybe we can use a LLM to extract the keywords from each instruction, then we integrate these keywords and put them to the text encoder just once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants