-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CogAgent #9
Comments
So I implemented CogAgent and debugged it until inference was running but it turns out that CogAgent can't do inference with more than 1 image at a time due to the cross attention module, which basically means it can't really ingest videos. One potential way around this is to maybe go frame by frame and try to get it to label it and then it can condition on the current video frame + previous auto regressive labels, but that way it won't do any in context learning as it can't really attend over past (screenshot, label) pairs in the context. We could switch to CogVLM which should be able to do multiple images as it doesn't have the cross attention module, but it also hasn't been fine-tuned / trained on software data (only CogAgent was), so it'd probably do very poorly especially when image resolution is 400x400. Let's discuss next week, but I think we might need to rethink whether or not we want to use this baseline. |
I updated the branch to use CogVLM2-Video. It seems to work but the output is bad, it just copies the SOP from the prompt instead of attending to the screenshots. Probably need to try different inference parameters (temperature / top_p) but also might need to change the prompt cause the model isn't as smart. Another issue is that the inference code for CogVLM2-Video expects 1 video (one big array of shape [Channel, Timestep, Width, Height]) which messes up the ICL which interleaves the video tokens with their labels and I assume the assumption is that the model should be able to use the positional encoding to find the connection between the video frames and their label. In CogVLM2-Video's inference code, basically all the text tokens get pooled together and all the image tokens get pooled together if that makes sense. Nevertheless we should be able to at least use it for zero shot evaluation I guess if we tune the prompt and inference settings a bit. |
No description provided.
The text was updated successfully, but these errors were encountered: