-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Data Pipeline #10
Comments
You only need to mask all the answers and system prompts, reserving all the questions as the query text. I think splitting each multi-turn sample into multiple single-turn samples is inefficiency. A intuitive example is that, if the user asked two questions about the image, we need to preserve the image details that related to all the questions. |
@yfzhang114 Thanks for your timely response!! I get your thought, but I am concerning if the text encoder can effectively extract the semantic information from multiple questions at once. For esample, by checking the conversation contents in LLaVA-665K, I noticed that most of the questions do not relate to each other although they are corresponding to the same image. Currently, I have splitted the multi-turn samples of LLaVA-665K so the sample size have boosted from 665K to a number above 3200K, causing the training so inefficient. Still looking for a better solution :) |
The text encoder is expected to handle multiple questions, with each question just consisting of keywords. It can accommodate a variety of keywords for enhanced processing. However, since this is still an experimental area, further testing and experimentation are necessary to validate these capabilities. |
@yfzhang114 Yea, maybe pre-processing the instructions is necessary in this case. As you said, maybe we can use a LLM to extract the keywords from each instruction, then we integrate these keywords and put them to the text encoder just once. |
Hi authors:
Thanks for your impressive work! Nowadays I am working on an idea of using the text instruction to guide the fusion of visual tokens, but I am confused of how to process the multi-turn conversations in the training set (like the samples in LLaVA-665K dataset). In the multi-turn case, you have multiple text instructions to deal with. I noticed that you have already used a text-guided router in the model architecture, so I am wondering how you deal with this issue? Do I have to split each multi-turn sample into multiple single-turn samples or there is a moe efficient way to work it out?
Thanks!
The text was updated successfully, but these errors were encountered: