Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CogAgent #9

Open
mcx-agile-loop opened this issue Aug 15, 2024 · 2 comments
Open

CogAgent #9

mcx-agile-loop opened this issue Aug 15, 2024 · 2 comments
Assignees

Comments

@mcx-agile-loop
Copy link
Collaborator

No description provided.

@rainx0r rainx0r self-assigned this Aug 15, 2024
@rainx0r
Copy link
Collaborator

rainx0r commented Aug 18, 2024

So I implemented CogAgent and debugged it until inference was running but it turns out that CogAgent can't do inference with more than 1 image at a time due to the cross attention module, which basically means it can't really ingest videos.

One potential way around this is to maybe go frame by frame and try to get it to label it and then it can condition on the current video frame + previous auto regressive labels, but that way it won't do any in context learning as it can't really attend over past (screenshot, label) pairs in the context.

We could switch to CogVLM which should be able to do multiple images as it doesn't have the cross attention module, but it also hasn't been fine-tuned / trained on software data (only CogAgent was), so it'd probably do very poorly especially when image resolution is 400x400.

Let's discuss next week, but I think we might need to rethink whether or not we want to use this baseline.

@rainx0r
Copy link
Collaborator

rainx0r commented Aug 20, 2024

I updated the branch to use CogVLM2-Video. It seems to work but the output is bad, it just copies the SOP from the prompt instead of attending to the screenshots. Probably need to try different inference parameters (temperature / top_p) but also might need to change the prompt cause the model isn't as smart.

Another issue is that the inference code for CogVLM2-Video expects 1 video (one big array of shape [Channel, Timestep, Width, Height]) which messes up the ICL which interleaves the video tokens with their labels and I assume the assumption is that the model should be able to use the positional encoding to find the connection between the video frames and their label. In CogVLM2-Video's inference code, basically all the text tokens get pooled together and all the image tokens get pooled together if that makes sense.

Nevertheless we should be able to at least use it for zero shot evaluation I guess if we tune the prompt and inference settings a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants